Download Metrics

From EPrints Documentation
Revision as of 13:55, 8 February 2010 by Pm705 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Download Metrics in EPrints

Capturing Usage

Eprints uses the Apache web server and mod_perl (an accelerator for Perl CGI scripts). mod_perl provides hooks to the logging and control features of the Apache server. One of these hooks is to the access logging function. A log handler is registered with Apache by using the PerlLogHandler pragma e.g.

PerlLogHandler EPrints::Apache::LogHandler

This handler is called after a response has been generated by the web server (i.e. in response to requests coming in from web clients).

In the initial implementation we are only interested in requests to eprint objects. This ignores requests to static content (e.g. the home page), searches and user-specific content (e.g. depositing processes). This is because our current requirements are to capture usage to determine download metrics for objects contained in a repository, rather than analysing how a repository is used.

For the remainder of this section we will use the following example request (as you would otherwise see in an Apache access log file):

127.0.0.1 - - [25/May/2007:13:07:24 +0100] "GET /12614/01/Semantic_Web_Revisted.pdf HTTP/1.1" 200 130951 "http://www.w3.org/2001/sw/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

The pertinent parts of this entry are:

Fields in a HTTP request.
field example
requesting host 127.0.0.1
date/time of request (by server time) 25/May/2007:13:07:24 +0100
page requested /12614/01/Semantic_Web_Revisted.pdf
HTTP response code (200 is 'ok') 200
referring web page http://www.w3.org/2001/sw/
user's web browser (or web crawler id) Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

In the Eprints log handler any non-200 HTTP response code requests are ignored i.e. any errors or redirects. A simple heuristic is applied to separate out the requests into three potential types: non-eprint requests, eprint abstract requests and full-text requests. This is based on the page requested field:

Parts of a page request URL.
eprint identifier document identifier file requested
/12614/ /01/ Semantic_Web_Revisted.pdf

Requests without an eprint identifier part are ignored. Requests with no document identifier part are defined as abstract requests. Requests with both eprint and document identifier parts are defined as full-text requests.

A simple filter is implemented in the capture code to handle requests to multi-file documents (e.g. HTML pages with inlined images). Any full-text requests where the referring web page (see above) is also a request to the same full-text are ignored. Assuming the referring web page is correctly supplied by the web client this will count requests to any part of a document, but ignore multiple requests resulting from complex media.

Storing Usage Data

Requests are stored in the Eprints access dataset, see Access_Object for a description of the dataset's fields. This is partially-processed mirror of the normal Apache log file, but storing the data in a table enables much simpler exposure and re-use of the data.

The Eprints dataset API provides a unique identifier for every request, which is an incremented integer starting at 1. As a dataset requests can be searched or iterated over using the Eprints API.

Exposing Usage

Eprints 3 includes a script that can expose access logs through an OAI-PMH interface. This interface is located at /cgi/oai_accesslogs. The interface contains an EPrints::abort at the top that prevents all access - this is because usage logs are potentially sensitive data: a repository will want to implement some kind of access restriction before providing access to their logs.

The OAI interface uses Eprints export plugins to convert access log objects (as read from the Eprints database) into XML suitable to be embedded in the OAI output. From an Eprints architecture perspective the mechanism to expose access log records is nearly identical to the exposure of eprint records (they are both datasets in Eprints).

eprint/access OAI comparison
eprint access
identifier oai:<archive name>:<item id> oai:<archive name>:<item id>
export plugins (predefined list in cfg.d/oai.pl) can_accept => "list/access"
from/until last_mod field datestamp field
OAI sets arbitrary fields defined in cfg.d/oai.pl not-supported

An OAI ListRecords request may contain date and set filters, although sets aren't yet supported for access logs. OAI date filters consist of optional from and until arguments that restrict the resulting data to only those accesses that occurred between the two dates.

Using the Eprints search API a result set of records is generated that matches the OAI arguments e.g. all access log records made after a certain time. Each record is converted into XML using the export plugin requested (via a mapping of the metadataPrefix OAI argument to plugin identifier) and the result output as an OAI response.

Currently Eprints supports exporting access log records in OpenURL ContextObject format. To be OAI-compliant an interface should also support Dublin Core (but access logs don't make a lot of sense in DC). The following is our example request, exposed as an XML ContextObject in an OAI response:

<ctx:context-object timestamp="2007-05-25T13:07:19Z" >
<ctx:referent>
<ctx:identifier>oai:eprints.ecs.soton.ac.uk:12614</ctx:identifier>
</ctx:referent>
<ctx:referring-entity>
<ctx:identifier>http://www.w3.org/2001/sw/</ctx:identifier>
</ctx:referring-entity>
<ctx:requester>
<ctx:identifier>urn:ip:127.0.0.1</ctx:identifier>
<ctx:private-data>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)</ctx:private-data>
</ctx:requester>
<ctx:service-type>
<ctx:metadata-by-val>
<ctx:format>info:ofi/fmt:xml:xsd:sch_svc</ctx:format>
<sv:svc-list>
<sv:fulltext>yes</sv:fulltext>
</sv:svc-list>
</ctx:metadata-by-val>
</ctx:service-type>
</ctx:context-object>

Analysing Usage

TODO: epstats