IRStats

From EPrints Documentation
Revision as of 16:02, 8 February 2010 by Pm705 (talk | contribs)
Jump to: navigation, search

IRStats is a flexible statistics package which allows easy processing of accesses to fulltext and abstract pages of eprints. For more detailed information, please see the IRStats Technical Documentation.

Technical Overview

The following is a quick tour of IRStats.

Parameters

IRStats output depends on four parameters, which need to be passed as cgi parameters if called through a web browser, or in a hash if called through the Perl API. These are:

Start Date and End Date

Date parameters are implemented as separate day, month and year parameters, so these two parameters are actually six (start_day, start_month, start_year, end_day, end_month, end_year). Any statistics outside this daterange are ignored.

An Eprint Set

As well as defining a daterange, we also have to inform IRStats of which publications we are interested in. Any publication not in the set will be ignored. A set of eprints can either be a single eprint or any set of eprints the system administrator wishes to define in the config files.

View

The final parameter tells IRStats how we want to process and display the statistics. This is done by selecting a View.

Views

Views are perl modules which plug in to IRStats. They have been designed to be user configurable, though some knowledge of perl is probably required. When a query is made to IRStats, a View is created. It generates some parameters for the DatabaseInterface object, which queries the database and passes back the results of the query. The View then iterates over the database rows and processes the stats in any way programmatically possible. These processed results are then passed to a Visualisation.

Visualisations

A Visualisation takes a set of processed statistics and outputs them. For example, Visualisation::Graph::Pie creates a pie chart.

The Database Interface

The Database Interface object handles all queries to the database. Most requests for statistics can be completed with a single call to the get_stats($params) method.

Data Flow Diagram

Irstats overview.png

Required Data

In order for IRStats to run, it requires two things:

  • a database table containing all hits to the repository
  • text files describing the contents of the repository

The Hits Table

Awaiting a redevelopment.

The Text Files

In order for IRStats to build up a picture of a repository, a number of text files need to be created and stored in the cfg/ directory:

  • epstats_set_membership.txt
  • epstats_set_member_codes.txt
  • epstats_set_member_full_citations.txt
  • epstats_set_member_short_citations.txt
  • epstats_set_member_urls.txt

Explanation by Example

Imagine a very small repository. Here are its contents:

  • eprints
    • (1) The Smells of Cheese
    • (2) The Tastes of Wines
    • (3) The Sounds of Oboes
  • Authors
    • (1) John Smith
    • (2) Harriet Jones

If we then imagine that the following are also true:

  • John Smith is credited with being an author of eprints (1) and (2)
  • Harriet Jones is credited with being an author of eprints (2) and (3)
  • All three eprints are the output of a research group named "Senses"
Creating sets

Sets are groups of eprints, and every eprint is a member of at least one set (the set containing only that eprint). From the information above, we have three sets. The eprint set, the author set and the research group set. We need to add the following to epstats_set_membership.txt (the format is <id><tab><csv list of eprint ids>

author_1        1,2
author_2        2,3
group_1         1,2,3
eprint_1        1
eprint_2        2
eprint_3        3
Giving Sets IDs

So, we now have some sets, but we need to give them unique IDs so that we can retrieve stats for these sets. To do this, we add the following to epstats_set_member_codes.txt:

author_1        js
author_2        hj
group_1         senses
eprint_1        1
eprint_2        2
eprint_3        3

IRStats now assigns the following unique IDs to each set: author_js, author_hj, group_senses, eprint_1, eprint_2, eprint_3. Note that the IDs should probably be kept alphanumeric, and must be unique within a class of sets (but you can have author_hj, group_hj and eprint_hj).

Citations

IRStats uses two citations for each set member, one short and one long. Which you use depends on how you would like your visualisation to look. However, we do need to add these to the citations files:

epstats_set_member_short_citations.txt

author_1         Smith

epstats_set_member_full_citations.txt

author_1         Dr John Smith, PhD

Note that the above examples are only for author_1. It would be exactly the same for any set member.

URLs

Although URLs are not currently implemented, it is probably a good idea to include this information (in epstats_set_member_urls.txt) for future functionality.

author_1         http://homepage.john.smith.com/

Installing IRStats

To run IRStats there are two separate processes that need to be completed:

  • Creating the Log Files if the required format
  • Running IRStats

Creating the Log Files

To create the log file it is recommended that you have the following installed:

Dependencies

Logfile::EPrints

The Logfile::Eprints modules are used to assist in filtering the raw access log. They can be installed from CPAN.

AWStats

AWStats data is used to filter out webspiders and classify search engines. This is a separate log analysing program and can be obtained from http://awstats.sourceforge.net/

Once AWStats is installed it is necessary to edit irstats.cfg to enter the correct path to the perl modules. The default path is /usr/local/awstats/wwwroot/cgi-bin/lib/search_engines.pm

Geo::IP or Geo::IP::PurePerl

Geo::IP is used to fill in country and organisation information. The country database is free, but if you want organisation information, you will have to purchase a subscription for their database.

The pure perl version of Geo::IP which is Geo::IP::PurePerl is available from CPAN but does not support organisations.

MySQL

The information about the log files is stored in a database file so it is necessary to have a MySQL client and server running (or equivalent).

If you are importing the data from elsewhere rather than generating it yourself then the SQL to import the dump file is:

mysql -uroot --database=[database name] < [table name].dump

The minimum tables you need dump files for to create the standard graphs are:

  • irstats_true_accesses_table
  • irstats_column_referrer_scope
  • irstats_column_referring_entity_id
  • irstats_column_requester_host
  • irstats_column_requester_organisation
  • irstats_column_search_engine
  • irstats_column_search_terms

Information about the database configuration needs to be set in the irstats.cfg file.

As well as the database tables it is necessary to create a user and password which the script can use to access the data and give that user the necessary permissions. The SQL is:

grant all privileges on [database name].* to [user name]@localhost identified by '[user password]';

Creating the Graphs

Once the log files are created IRStats has the following dependencies

Dependencies

Date::Calc

Date::Calc is used to control the periods that information is returned for. The module can be downloaded from CPAN

Installing

Once all the required programs and modules have been installed then IRStats can be installed and run.

The IRStats files should be copied, untarred if necessary, into the /opt/ directory

If IRStats is put elsewhere then the paths to the relevant files need to be set in the irstats.cfg directory. It is worth checking the irstats.cfg directory anyway to confirm that all the paths are set to the correct ones for your setup.

Folder Permissions

Folders requiring Read and Execute Permissions
  • /irstats/cfg
  • /irstats/cgi
  • /irstats/perl_lib
Folders requiring Read and Write Permissions
  • /irstats/cgi/view_thumbs
  • /irstats/cache
  • /irstats/img

Running the Perl Script to Populate the Database

In irstats.cfg edit the paths of the of the files used to store set information so they are correct. The default place for these files is in /opt/irstats/data/ so if the path is set to /opt/irstats/cfg/ it needs to be changed.

Due to a small bug it is necessary to open the irstats/bin/import_metadata.pl script and comment out the following lines before it is run (the lines can be commented out by adding a # at the start of each line):

$database->do_sql("DROP TABLE $table");

$database->do_sql("DROP TABLE $citation_table");

$database->do_sql("DROP TABLE $code_table");


Having commented out these lines run the perl script. This will populate the database with the necessary author, paper and group tables.

Once the script has been run successfully uncomment the three lines back into the script.

Configuring Apache

In the apache2 configuration file it is necessary to add the following lines:

Alias /stats/view_thumbs /opt/irstats/cgi/view_thumbs

ScriptAlias /stats/ /opt/irstats/cgi/

Alias /img/ /opt/irstats/img/


(Don't forget to restart apache after you have made the changes to the config file)

Customising

It will almost always be necessary to perform some customisation on IRStats because every repository is different.

Updating the Table

The tables are updated by running the update_tables.pl script which is located in the /data/ folder. This script needs to be run whenever the tables need to be changed. For most systems it is recommended that the script is automatically run at a given interval, for example once a night.

Creating New Views

A view has three subs:

  • initialise
  • new
  • populate

The basic view program looks like this:

package IRStats::View::<View Name Here>;

use strict;
use warnings;

use IRStats::DatabaseInterface;
use IRStats::Cache;
use IRStats::Visualisation::<Visualisation Module Here>;
use IRStats::View;
use Data::Dumper;


our @ISA = qw/ IRStats::View /;

sub initialise
{
	<Initialisation Code Here>
}

sub new
{
	<New Code Here>
}

sub populate
{
	<Population Code Here>
}

1;

Considering each of the subs:

initialise
my ($self) = @_;


Define SQL Parameters:

$self->{'sql_params'} ={
		<parameters go here in comma-separated list>
	};

The paramters are:

  • columns (which columns to return. May include 'COUNT')
  • where (any conditionals, divided into column, operator and value. Multiple conditionals can be added as a comma-separated list with each set of conditional statements surrounded by curly brackets)
  • group (what the returned information should be grouped by)
  • order (the order that the returned information should be in. Divided into column and direction.)
  • limit (limit on the number of answers)

Only necessary to include the Parameters that you need to set.

For example:

	$self->{'sql_params'} = {
		columns => [ 'eprint', 'COUNT' ],
		group => "eprint",
		order => {column => "COUNT", direction => "DESC"},
		limit => 10
	};

Having defined the SQL parameters it is necessary to set up the graph constructor

        $self->{'visualisation'} = <Graph Module here>->new(
	{
		<Parameters go here>
	}
        );

The Graph Constructors are:

  • IRStats::Visualisation::Graph::Bar
  • IRStats::Visualisation::Graph::Line
  • IRStats::Visualisation::Graph::Pie
  • IRStats::Visualisation::Populate: Table::HTML & Table::CSV::CSV
  • IRStats::Visualisation::Populate: Table::HTML & Table::CSV
  • IRStats::Visualisation::Table::HTML_Columned
  • IRStats::Visualisation::HTML

Different constructors have different parameters:


Graphs:

IRStats::Visualisation::Graph::Bar and IRStats::Visualisation::Graph::Line

filename => $self->{'params'}->get('id') . ".png",
title => "<Your Title Here>",
x_title => "<You X-Axis Title Here>",
y_title => "<You Y-Axis Title Here>",
data_series => [],
x_labels => [],
params => $self->{params}

IRStats::Visualisation::Graph::Pie

filename => $self->{'params'}->get('id') . ".png",
title => "<Your Title Here>",
data_series => [],
params => $self->{params}

IRStats::Visualisation::Populate: Table::HTML & Table::CSV::CSV and IRStats::Visualisation::Table::HTML & Table::CSV

columns => [<Comma-Separated List of Column Headers Here>],
rows => []

IRStats::Visualisation::Populate: Table::HTML & Table::CSV_Columned

title => "<Your Title Here>",
columns => [<Comma-Separated List of Column Headers Here>],
rows => []

IRStats::Visualisation::HTML

html => '<Any Default HTML Goes Here>'


Having created the constructor, you may wish to create a number of global parameters to store information such asm the maxium number of rows. In which case *after* the constructor you add the line

$self->{<Your Parameter>} = <Your Value Here>; 


So the whole thing should look like:

sub initialise
{
        my ($self) = @_;
	$self->{'sql_params'} = {
		<Your Parameters Here>
	};
        $self->{'visualisation'} = <Your Visualation Type Here> ->new(
	{
		<Your Parameters Here>
	}
        );
	<Any Additional Parameters Here>
}
New
sub new
{
        my( $class, $params, $database ) = @_;
        my $self = $class->SUPER::new($params, $database);;
        $self->initialise();
        return $self;
}
Populate

Populate is the complicated section where the main programming takes place.

It almost always starts with the following delarations:

	my ($self) = @_;

##Check Cache
	my $cache = IRStats::Cache->new($self->{'params'});
	if ($cache->exists)
	{
		$self->{'visualisation'} = $cache->read();
		return;
	}

and ends:


	$self->{'visualisation'}->set('x_labels', $x_labels);
	$self->{'visualisation'}->set('data_series', $data_series);

	##write to cache
	$cache->write($self->{'visualisation'});

although the setting of the $self->{'visualisation'} depends on which visualisations are needed. The following is a general guide:

Graphs (Not Pie):

$self->{'visualisation'}->set('x_labels', $x_labels);
$self->{'visualisation'}->set('data_series', $data_series);

Pie Graphs:

$self->{'visualisation'}->set('data_series', $data_series);

Plain HTML:

$self->{'visualisation'}->set('html',$html);

Tables and CSV:

$self->{'visualisation'}->set('rows',$rows);


The sub should also contain a call to the database to carry out the previously defined query

       <define variables>

	my $query = $self->{'database'}->get_stats(
			$self->{'params'},
			$self->{'sql_params'}
			);

	while ( my @row = $query->fetchrow_array() )
	{
		<assign the results to the relevant variables>
	}
	$query->finish(); 

As well as the above the populate sub contains the code to analyze, alter and manipulate the data retrieved from the database before publishing it as a graph.

The most basic function resembles the following:


	my ($self) = @_;
##Check Cache
	my $cache = IRStats::Cache->new($self->{'params'});
	if ($cache->exists)
	{
		$self->{'visualisation'} = $cache->read();
		return;
	}

	<create variables e.g. my $rows = [];>

	my $query = $self->{'database'}->get_stats(
			$self->{'params'},
			$self->{'sql_params'}
			);

	while ( my @row = $query->fetchrow_array() )
	{
		<process and store data e.g. push @{$rows}, \@row;>
	}
	$query->finish(); 

	<send to visualisation e.g. $self->{'visualisation'}->set('rows',$rows);>

	##write to cache
	$cache->write($self->{'visualisation'});