Download Metrics

From EPrints Documentation
Revision as of 13:13, 25 May 2007 by Timbrody (talk | contribs)
Jump to: navigation, search

Download Metrics in EPrints

Capturing Usage

Eprints uses the Apache web server and mod_perl (an accelerator for Perl CGI scripts). mod_perl provides hooks to the logging and control features of the Apache server. One of these hooks is to the access logging function. A log handler is registered with Apache by using the PerlLogHandler pragma e.g.

PerlLogHandler EPrints::Apache::LogHandler

This handler is called after a response has been generated by the web server (i.e. in response to requests coming in from web clients).

In the initial implementation we are only interested in requests to eprint objects. This ignores requests to static content (e.g. the home page), searches and user-specific content (e.g. depositing processes). This is because our current requirements are to capture usage to determine download metrics for objects contained in a repository, rather than analysing how a repository is used.

For the remainder of this section we will use the following example request (as you would otherwise see in an Apache access log file):

127.0.0.1 - - [25/May/2007:13:07:24 +0100] "GET /12614/01/Semantic_Web_Revisted.pdf HTTP/1.1" 200 130951 "http://www.w3.org/2001/sw/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

The pertinent parts of this entry are:

Fields in a HTTP request.
field example
requesting host 127.0.0.1
date/time of request (by server time) 25/May/2007:13:07:24 +0100
page requested /12614/01/Semantic_Web_Revisted.pdf
HTTP response code (200 is 'ok') 200
referring web page http://www.w3.org/2001/sw/
user's web browser (or web crawler id) Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

In the Eprints log handler any non-200 HTTP response code requests are ignored i.e. any errors or redirects. Requests are then split up into three sets, non-eprint requests, eprint abstract requests and full-text requests. This is based on the page requested field.

Parts of a page request URL.
/12614/ /01/ Semantic_Web_Revisted.pdf
eprint identifier document identifier file requested

Request without an eprint identifier part are ignored. Requests with no document identifier part are defined as abstract requests. Requests with both eprint and document identifier parts are defined as full-text requests.

A simple filter is implemented in the capture code to handle requests to multi-file documents (e.g. HTML pages with inlined images). Any full-text requests where the referring web page (see above) is also a request to the same full-text are ignored. Assuming the referring web page is correctly supplied by the web client this will count requests to any part of a document, but ignore multiple requests resulting from complex media.

Storing Usage

Requests are stored in the Eprints database in the access_log table, see [Access_Log] for a description of the fields. This is almost a mirror of the normal Apache log file, but by storing the data in a table enables much simpler exposure and re-use of the datab.

Log data can potentially get very large, which may require a different technical approach in future (or periodic triaging of the access log data).

Exposing Usage

Eprints 3 includes a script that can expose access logs through an OAI-PMH interface. This interface is located at /cgi/oai_accesslogs. The interface contains an EPrints::abort at the top that prevents all access - this is because usage logs are potentially sensitive data: a repository will want to implement some kind of access restriction before providing access to their logs.

The OAI interface uses Eprints export plugins to convert access log objects (as read from the Eprints database) to XML suitable to be embedded in the OAI output. From an Eprints architecture perspective the mechanism to expose access log records is nearly identical to the exposure of eprint records (they are both datasets in Eprints).

accesslog/eprint OAI comparison
eprint accesslog
identifier oai:<archive name>:<item id> oai:<archive name>:<item id>
export plugins (predefined list in cfg.d/oai.pl) can_accept => "list/access"
from/until last_mod field datestamp field
OAI sets arbitrary fields defined in cfg.d/oai.pl not-supported

An OAI ListRecords request may contain date and set filters, although sets aren't yet supported for access logs. Using the Eprints search API a result set of records is generated that matches the OAI arguments e.g. all access log records made after a certain time. Each record is converted to XML, depending on the export plugin requested (via a mapping of the metadataPrefix OAI argument to plugin identifier), and output as an XML document.

Currently Eprints supports exporting access log records in OpenURL ContextObject format. To be OAI-compliant an interface should also support Dublin Core (but access logs don't make a lot of sense in DC). The following is our example request, exposed as an XML ContextObject in an OAI response:

<ctx:context-object timestamp="2007-05-25T13:07:19Z" >
<ctx:referent>
<ctx:identifier>oai:eprints.ecs.soton.ac.uk:12614</ctx:identifier>
</ctx:referent>
<ctx:referring-entity>
<ctx:identifier>http://www.w3.org/2001/sw/</ctx:identifier>
</ctx:referring-entity>
<ctx:requester>
<ctx:identifier>urn:ip:127.0.0.1</ctx:identifier>
<ctx:private-data>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)</ctx:private-data>
</ctx:requester>
<ctx:service-type>
<ctx:metadata-by-val>
<ctx:format>info:ofi/fmt:xml:xsd:sch_svc</ctx:format>
<sv:svc-list>
<sv:fulltext>yes</sv:fulltext>
</sv:svc-list>
</ctx:metadata-by-val>
</ctx:service-type>
</ctx:context-object>

Analysing Usage

TODO: epstats