Access Log Dataset

From EPrints Documentation
Revision as of 13:17, 18 December 2006 by Timbrody (talk | contribs)
Jump to: navigation, search

EPrints 3 introduces some new features for capturing and analysing usage of the repository.

Everytime an abstract or full-text is accessed a record is written to the access dataset recording data about the access.

access data set fields

OpenURL terminology is used where an equivalent term in OpenURL is defined.

field description
accessid A unique identifier for the access (sequential numbering).
datestamp The time the access occurred in UTC.
requester_id An identifier for the user that made the request.
requester_user_agent The user agent string as given by the user's browser.
requester_country The ISO country code for the user's location.
requester_institution The user's institution or organisation.
referring_entity_id The identifier of the object that the user followed a link from.
service_type_id The type of service requested, either fulltext=yes or abstract=yes.
referent_id The identifier of the eprint requested.
referent_docid The document number requested (for fulltext requests).

requester_id

The requester id contains an identifier for the user that made the request. In the release version this consists of uri:ip: followed by the IP address of the user's connection. Particular network topographies may result in the IP address being an intermediary e.g. a caching proxy server.

In future the requester_id field may contain a unique cookie or similar mechanism for identifying users.

Requester country and institution

If available the Maxmind GeoIP databases can be used to capture the country code and organisation of the user, based on their IP address.

referent_id

The referent - the object being requested - is stored using the full OAI identifier. This may be replaced with just the eprint number.

Filtering at the LogHandler Stage

Requests to the EPrints web server are captured using a mod_perl handler (EPrints::Apache::LogHandler). This handler gets called on every request to the web server. The handler filters out all non-HTTP 200 requests (e.g. ignores redirects and partial-content). Only requests to abstract and full-text URLs are recognised i.e. requests to /xx/ and /xx/yy/... where xx is the eprint id and yy the document id.

As a special case any requests to a full-text where the referring entity is also a request to the full-text are ignored. The net result of this is to ignore all inline content e.g. images and javascript in HTML documents.

Harvesting accesses from an EPrints 3 repository

Currently Disabled!

A new OAI interface has been added to EPrints 3 that reads records from the access data set. This supports all the normal OAI verbs but only supports exposing metadata in OpenURL ContextObject format.

Because accesses are explicitly time based a harvester can easily harvest accesses for a specific time period by using the OAI from and until arguments.

Disk Usage Considerations

The disk usage requirements for storing accesses in the EPrints database are yet to be fully tested.