Difference between revisions of "IRStats"

From EPrints Documentation
Jump to: navigation, search
(Geo::IP)
Line 131: Line 131:
 
AWStats data is used to filter out webspiders and classify search engines.  The irstats.cfg must have an entry showing where the correct perl modules are.
 
AWStats data is used to filter out webspiders and classify search engines.  The irstats.cfg must have an entry showing where the correct perl modules are.
  
==== Geo::IP ====
+
==== Geo::IP or Geo::IP::PurePerl ====
  
Geo::IP is used to fill in country and organisation information.  The country database is free, but if you want organisation information, you will have to purchase a subscription for their database.  The location of the database should also be inserted into irstats.cfg.
+
Geo::IP is used to fill in country and organisation information.  The country database is free, but if you want organisation information, you will have to purchase a subscription for their database.  
  
Note: The pure perl version of Geo::IP does not support organisations.
+
The pure perl version of Geo::IP which is Geo::IP::PurePerl is available from CPAN but does not support organisations.
 +
 
 +
The location of the database should also be inserted into irstats.cfg.
  
 
=== Installing ===
 
=== Installing ===

Revision as of 17:45, 2 August 2007

IRStats is a flexible statistics package which allows easy processing of accesses to fulltext and abstract pages of eprints. For more detailed information, please see the IRStats Technical Documentation.

Technical Overview

The following is a quick tour of IRStats.

Parameters

IRStats output depends on four parameters, which need to be passed as cgi parameters if called through a web browser, or in a hash if called through the Perl API. These are:

Start Date and End Date

Date parameters are implemented as separate day, month and year parameters, so these two parameters are actually six (start_day, start_month, start_year, end_day, end_month, end_year). Any statistics outside this daterange are ignored.

An Eprint Set

As well as defining a daterange, we also have to inform IRStats of which publications we are interested in. Any publication not in the set will be ignored. A set of eprints can either be a single eprint or any set of eprints the system administrator wishes to define in the config files.

View

The final parameter tells IRStats how we want to process and display the statistics. This is done by selecting a View.

Views

Views are perl modules which plug in to IRStats. They have been designed to be user configurable, though some knowledge of perl is probably required. When a query is made to IRStats, a View is created. It generates some parameters for the DatabaseInterface object, which queries the database and passes back the results of the query. The View then iterates over the database rows and processes the stats in any way programmatically possible. These processed results are then passed to a Visualisation.

Visualisations

A Visualisation takes a set of processed statistics and outputs them. For example, Visualisation::Graph::Pie creates a pie chart.

The Database Interface

The Database Interface object handles all queries to the database. Most requests for statistics can be completed with a single call to the get_stats($params) method.

Data Flow Diagram

Irstats overview.png

Required Data

In order for IRStats to run, it requires two things:

  • a database table containing all hits to the repository
  • text files describing the contents of the repository

The Hits Table

Awaiting a redevelopment.

The Text Files

In order for IRStats to build up a picture of a repository, a number of text files need to be created and stored in the cfg/ directory:

  • epstats_set_membership.txt
  • epstats_set_member_codes.txt
  • epstats_set_member_full_citations.txt
  • epstats_set_member_short_citations.txt
  • epstats_set_member_urls.txt

Explanation by Example

Imagine a very small repository. Here are its contents:

  • eprints
    • (1) The Smells of Cheese
    • (2) The Tastes of Wines
    • (3) The Sounds of Oboes
  • Authors
    • (1) John Smith
    • (2) Harriet Jones

If we then imagine that the following are also true:

  • John Smith is credited with being an author of eprints (1) and (2)
  • Harriet Jones is credited with being an author of eprints (2) and (3)
  • All three eprints are the output of a research group named "Senses"
Creating sets

Sets are groups of eprints, and every eprint is a member of at least one set (the set containing only that eprint). From the information above, we have three sets. The eprint set, the author set and the research group set. We need to add the following to epstats_set_membership.txt (the format is <id><tab><csv list of eprint ids>

author_1        1,2
author_2        2,3
group_1         1,2,3
eprint_1        1
eprint_2        2
eprint_3        3
Giving Sets IDs

So, we now have some sets, but we need to give them unique IDs so that we can retrieve stats for these sets. To do this, we add the following to epstats_set_member_codes.txt:

author_1        js
author_2        hj
group_1         senses
eprint_1        1
eprint_2        2
eprint_3        3

IRStats now assigns the following unique IDs to each set: author_js, author_hj, group_senses, eprint_1, eprint_2, eprint_3. Note that the IDs should probably be kept alphanumeric, and must be unique within a class of sets (but you can have author_hj, group_hj and eprint_hj).

Citations

IRStats uses two citations for each set member, one short and one long. Which you use depends on how you would like your visualisation to look. However, we do need to add these to the citations files:

epstats_set_member_short_citations.txt

author_1         Smith

epstats_set_member_full_citations.txt

author_1         Dr John Smith, PhD

Note that the above examples are only for author_1. It would be exactly the same for any set member.

URLs

Although URLs are not currently implemented, it is probably a good idea to include this information (in epstats_set_member_urls.txt) for future functionality.

author_1         http://homepage.john.smith.com/

Installing IRStats

Dependencies

Logfile::EPrints

The Logfile::Eprints modules are used to assist in filtering the raw access log. They can be installed from CPAN.

AWStats

AWStats data is used to filter out webspiders and classify search engines. The irstats.cfg must have an entry showing where the correct perl modules are.

Geo::IP or Geo::IP::PurePerl

Geo::IP is used to fill in country and organisation information. The country database is free, but if you want organisation information, you will have to purchase a subscription for their database.

The pure perl version of Geo::IP which is Geo::IP::PurePerl is available from CPAN but does not support organisations.

The location of the database should also be inserted into irstats.cfg.

Installing

Customising

It will almost always be necessary to perform some customisation on IRStats because every repository is different.

Updating the Table