Web crawling robots are a fact of life. There are many "out there" on the web, many doing a good job at indexing our content.
However there are also an increasing number of robots which are causing repository owners problems.
These robots cause unnecessary load on the repository servers, as well as skewing the download statistics for the published data.
We at EPrints Services and IRUS have observed a number of harmful robots which can be identified either by their IP address or their user agent.
Are we working to produce and maintain a simple list of these, so they can be more easily filtered or blocked by repository systems administrators.
The first version of this list can be found below.
IRUS are currently using entries in this file to improve their reports.
EPrints Services are rolling out a version of IRStats2 which will filter out accesses from this list, and for hosted services blocking accesses at the firewall level.