IRStats
IRStats is a flexible statistics package which allows easy processing of accesses to fulltext and abstract pages of eprints. For more detailed information, please see the IRStats Technical Documentation.
Contents
Technical Overview
The following is a quick tour of IRStats.
Parameters
IRStats output depends on four parameters, which need to be passed as cgi parameters if called through a web browser, or in a hash if called through the Perl API. These are:
Start Date and End Date
Date parameters are implemented as separate day, month and year parameters, so these two parameters are actually six (start_day, start_month, start_year, end_day, end_month, end_year). Any statistics outside this daterange are ignored.
An Eprint Set
As well as defining a daterange, we also have to inform IRStats of which publications we are interested in. Any publication not in the set will be ignored. A set of eprints can either be a single eprint or any set of eprints the system administrator wishes to define in the config files.
View
The final parameter tells IRStats how we want to process and display the statistics. This is done by selecting a View.
Views
Views are perl modules which plug in to IRStats. They have been designed to be user configurable, though some knowledge of perl is probably required. When a query is made to IRStats, a View is created. It generates some parameters for the DatabaseInterface object, which queries the database and passes back the results of the query. The View then iterates over the database rows and processes the stats in any way programmatically possible. These processed results are then passed to a Visualisation.
Visualisations
A Visualisation takes a set of processed statistics and outputs them. For example, Visualisation::Graph::Pie creates a pie chart.
The Database Interface
The Database Interface object handles all queries to the database. Most requests for statistics can be completed with a single call to the get_stats($params) method.
Data Flow Diagram
Required Data
In order for IRStats to run, it requires two things:
- a database table containing all hits to the repository
- text files describing the contents of the repository
The Hits Table
Awaiting a redevelopment.
The Text Files
In order for IRStats to build up a picture of a repository, a number of text files need to be created and stored in the cfg/ directory:
- epstats_set_membership.txt
- epstats_set_member_codes.txt
- epstats_set_member_full_citations.txt
- epstats_set_member_short_citations.txt
- epstats_set_member_urls.txt
Explanation by Example
Imagine a very small repository. Here are its contents:
- eprints
- (1) The Smells of Cheese
- (2) The Tastes of Wines
- (3) The Sounds of Oboes
- Authors
- (1) John Smith
- (2) Harriet Jones
If we then imagine that the following are also true:
- John Smith is credited with being an author of eprints (1) and (2)
- Harriet Jones is credited with being an author of eprints (2) and (3)
- All three eprints are the output of a research group named "Senses"
Creating sets
Sets are groups of eprints, and every eprint is a member of at least one set (the set containing only that eprint). From the information above, we have three sets. The eprint set, the author set and the research group set. We need to add the following to epstats_set_membership.txt (the format is <id><tab><csv list of eprint ids>
author_1 1,2 author_2 2,3 group_1 1,2,3 eprint_1 1 eprint_2 2 eprint_3 3
Giving Sets IDs
So, we now have some sets, but we need to give them unique IDs so that we can retrieve stats for these sets. To do this, we add the following to epstats_set_member_codes.txt:
author_1 js author_2 hj group_1 senses eprint_1 1 eprint_2 2 eprint_3 3
IRStats now assigns the following unique IDs to each set: author_js, author_hj, group_senses, eprint_1, eprint_2, eprint_3. Note that the IDs should probably be kept alphanumeric, and must be unique within a class of sets (but you can have author_hj, group_hj and eprint_hj).
Citations
IRStats uses two citations for each set member, one short and one long. Which you use depends on how you would like your visualisation to look. However, we do need to add these to the citations files:
epstats_set_member_short_citations.txt
author_1 Smith
epstats_set_member_full_citations.txt
author_1 Dr John Smith, PhD
Note that the above examples are only for author_1. It would be exactly the same for any set member.
URLs
Although URLs are not currently implemented, it is probably a good idea to include this information (in epstats_set_member_urls.txt) for future functionality.
author_1 http://homepage.john.smith.com/
Installing IRStats
To run IRStats there are two separate processes that need to be completed:
- Creating the Log Files if the required format
- Running IRStats
Creating the Log Files
To create the log file it is recommended that you have the following installed:
Dependencies
Logfile::EPrints
The Logfile::Eprints modules are used to assist in filtering the raw access log. They can be installed from CPAN.
AWStats
AWStats data is used to filter out webspiders and classify search engines. This is a separate log analysing program and can be obtained from http://awstats.sourceforge.net/
Once AWStats is installed it is necessary to edit irstats.cfg to enter the correct path to the perl modules. The default path is /usr/local/awstats/wwwroot/cgi-bin/lib/search_engines.pm
Geo::IP or Geo::IP::PurePerl
Geo::IP is used to fill in country and organisation information. The country database is free, but if you want organisation information, you will have to purchase a subscription for their database.
The pure perl version of Geo::IP which is Geo::IP::PurePerl is available from CPAN but does not support organisations.
MySQL
The information about the log files is stored in a database file so it is necessary to have a MySQL client and server running (or equivalent).
If you are importing the data from elsewhere rather than generating it yourself then the SQL to import the dump file is:
mysql -uroot --database=<database name> < irstats_true_accesses_table.dump
Information about the database configuration needs to be set in the irstats.cfg file.
As well as the database tables it is necessary to create a user and password which the script can use to access the data and give that user the necessary permissions. The SQL is:
grant all privileges on <database name>.* to <user name>@localhost identified by '<user password>';
Creating the Graphs
Once the log files are created IRStats has the following dependencies
Dependencies
Date::Calc
Date::Calc is used to control the periods that information is returned for. The module can be downloaded from CPAN
Installing
Once all the required programs and modules have been installed then IRStats can be installed and run.
The IRStats files should be copied, untarred if necessary, into the /opt/ directory
If IRStats is put elsewhere then the paths to the relevant files need to be set in the irstats.cfg directory. It is worth checking the irstats.cfg directory anyway to confirm that all the paths are set to the correct ones for your setup.
Folder Permissions
Folders requiring Read and Execute Permissions
- /irstats/cfg
- /irstats/cgi
- /irstats/perl_lib
Folders requiring Read and Write Permissions
- /irstats/cgi/view_thumbs
- /irstats/cache
- /irstats/img
Running the Perl Script to Populate the Database
In irstats.cfg edit the paths of the of the files used to store set information so they are correct. The default place for these files is in /opt/irstats/data/ so if the path is set to /opt/irstats/cfg/ it needs to be changed.
Due to a small bug it is necessary to open the irstats/bin/import_metadata.pl script and comment out the following lines before it is run (the lines can be commented out by adding a # at the start of each line):
$database->do_sql("DROP TABLE $table");
$database->do_sql("DROP TABLE $citation_table");
$database->do_sql("DROP TABLE $code_table");
Having commented out these lines run the perl script. This will populate the database with the necessary author, paper and group tables.
Once the script has been run successfully uncomment the three lines back into the script.
Configuring Apache
In the apache2 configuration file it is necessary to add the following lines:
ScriptAlias /stats/ /opt/irstats/cgi/
Alias /img/ /opt/irstats/img/
Customising
It will almost always be necessary to perform some customisation on IRStats because every repository is different.