Bots
NOTE: 2025-07-23 THIS PAGE IS CURRENTLY UNDER CONSTRUCTION
The GLAM (Galleries, Libraries, Archives and Museums) sector have seen increased activity that appears to be automated, but doesn't identify itself as a 'robot' and doesn't follow robots.txt rules. The traffic makes repeated requests to systems (including EPrints repositories) for the same search terms, but with requests coming from a wide spread of IP addresses. Often an IP address will only make a single request, so traditional approaches to block traffic based in the IP address may not be enough.
Other attributes of these requests e.g. the User-Agent are variable and don't help to distinguish when requests are part of the same 'swarm' of requests.
The problematic traffic are requests to the search interface. These are , and this can cause performance issues with the platform. This isn't just an EPrints issue, other platforms experience the same issues.
Below are some approaches that may help to limit the impact of these 'swarms' of bots.
Contents
Analysing traffic
- if system logs are in an analysis platform e.g. splunk
- TODO (+ Thanks to members of the EP-Tech mailing list for some of the suggestions below.
Apache logs
TODO (detail from tech-list)
Cachemap
When the internal EPrints search is used, a cache table is created in the database, and the details of the search are stored in the core cachemap
table.
The pattern of 'swarm' activity can result in many identical searches being run. We can see these using the following query:
SET @threshold = 30;
SELECT COUNT(*) c, searchexp FROM cachemap GROUP BY searchexp HAVING c > @threshold ORDER BY c;
Analysis of the web logs using search expressions returned from above may identify some attributes of the 'swarm'.
NB If EPrints is using Xapian to process searches, the majority of searches will not create a cache table.
Blocking abusive search traffic
Firewall
???
fail2ban
??? - scanning logs for repeated cache=[x]
Apache configuration - mod_security (WAF)
# details from DRN 2025-07-23
Approach: EPrints configuration to block specific searches
The example below creates an EPrints trigger that is active when a request is being processed by EPrints. Any incoming requests that contain the terms in the $bad_search
configuration will not run a search, but will be presented with a '429 - too many requests' page.
NB the core Apache::Const module does not include a constant for a 429 response, so a numeric value is used instead of e.g. OK
or FORBIDDEN
# save in a cfg.d dir somewhere e.g. [EPRINTS_ROOT]/archives/[ARCHIVE_ID]/cfg/cfg.d/a_BOT_BLOCK.pl
use EPrints::Const;
### UPDATE THESE WITH SEARCHES YOU WANT TO BLOCK!
my $bad_search = join "|", map quotemeta, qw{
IN:Habits|
%3AHabits%7C
ZEPLIN+COSINE+DRIFT+ADMX+LIGO+Kamiokande+SBND
};
$c->{blocked_search_terms_re} = qr/$bad_search/;
$c->add_trigger( EP_TRIGGER_URL_REWRITE, sub {
my( %args ) = @_;
# args passed are: request, lang, args, urlpath, cgipath, uri, secure, return_code
my( $repository, $request, $return_code, $uri, $urlpath ) = @args{ qw( repository request return_code uri urlpath ) };
# Just interested in searches
if( $uri =~ /^$urlpath\/cgi\/search/ )
{
my $r_args = $request->args();
if( defined $r_args )
{
if( $r_args =~ /$c->{blocked_search_terms_re}/ )
{
#NB Apache2::COnst doesn't define 429.
$request->custom_response( 429, $c->{bot_429_page_html} );
${$return_code} = 429;
return EP_TRIGGER_DONE;
}
}
}
});
$c->{bot_429_page_html} = '<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<style>
body {
font-family: sans-serif;
margin: 3em;
}
footer {
font-size: 80%;
margin-top:2em;
}
</style>
<title>Rate Limited</title>
</head>
<body>
<header>
<h1>429 Too Many Requests</h1>
</header>
<section>
<p>This search has been blocked due to abuse by automated activity.</p>
</section>
<footer>
<p>White Rose Libraries</p>
</footer>
</body>
</html>';
Using 3rd party tools
- Cloudflare
- Anubis
Related resources
- Code4Lib Blocking Bots - there is also a useful channel on the Code4Lib slack
- Are AI Bots Knocking Cultural Heritage Offline? GLAM-E Lab report