Difference between revisions of "IRStats"

From EPrints Documentation
Jump to: navigation, search
(Removed out-of-date documentation)
Line 1: Line 1:
 
[[Category:IRStats]]
 
[[Category:IRStats]]
IRStats is a flexible statistics package which allows easy processing of accesses to fulltext and abstract pages of eprints.  For more detailed information, please see the [[IRStats Technical Documentation]].
+
IRStats is a flexible statistics package which allows easy processing of accesses to fulltext documents of eprints.  For more detailed information, please see the [[IRStats Technical Documentation]], though it is now somewhat out of date.
  
== Technical Overview ==
+
== The front end ==
  
The following is a quick tour of IRStats.
 
  
=== Parameters ===
+
== The configuration file ==
  
IRStats output depends on four parameters, which need to be passed as cgi parameters if called through a web browser, or in a hash if called through the Perl API. These are:
+
Documentation to follow.
 
+
==== Start Date and End Date ====
+
 
+
Date parameters are implemented as separate day, month and year parameters, so these two parameters are actually six (start_day, start_month, start_year, end_day, end_month, end_year).  Any statistics outside this daterange are ignored.
+
 
+
==== An Eprint Set ====
+
 
+
As well as defining a daterange, we also have to inform IRStats of which publications we are interested in.  Any publication not in the set will be ignored.  A set of eprints can either be a single eprint or any set of eprints the system administrator wishes to define in the config files.
+
 
+
==== View ====
+
 
+
The final parameter tells IRStats how we want to process and display the statistics.  This is done by selecting a View.
+
 
+
=== Views ===
+
 
+
Views are perl modules which plug in to IRStats.  They have been designed to be user configurable, though some knowledge of perl is probably required.  When a query is made to IRStats, a View is created.  It generates some parameters for the DatabaseInterface object, which queries the database and passes back the results of the query.  The View then iterates over the database rows and processes the stats in any way programmatically possible.  These processed results are then passed to a Visualisation.
+
 
+
=== Visualisations ===
+
 
+
A Visualisation takes a set of processed statistics and outputs them.  For example, Visualisation::Graph::Pie creates a pie chart.
+
 
+
=== The Database Interface ===
+
 
+
The Database Interface object handles all queries to the database.  Most requests for statistics can be completed with a single call to the get_stats($params) method.
+
 
+
=== Data Flow Diagram ===
+
[[Image:irstats_overview.png]]
+
 
+
== Required Data ==
+
 
+
In order for IRStats to run, it requires two things:
+
 
+
* a database table containing all hits to the repository
+
* text files describing the contents of the repository
+
 
+
=== The Hits Table ===
+
 
+
Awaiting a redevelopment.
+
 
+
=== The Text Files ===
+
 
+
In order for IRStats to build up a picture of a repository, a number of text files need to be created and stored in the cfg/ directory:
+
 
+
* epstats_set_membership.txt
+
* epstats_set_member_codes.txt
+
* epstats_set_member_full_citations.txt
+
* epstats_set_member_short_citations.txt
+
* epstats_set_member_urls.txt
+
 
+
==== Explanation by Example ====
+
 
+
Imagine a very small repository.  Here are its contents:
+
 
+
* eprints
+
** (1) The Smells of Cheese
+
** (2) The Tastes of Wines
+
** (3) The Sounds of Oboes
+
* Authors
+
** (1) John Smith
+
** (2) Harriet Jones
+
 
+
If we then imagine that the following are also true:
+
 
+
* John Smith is credited with being an author of eprints (1) and (2)
+
* Harriet Jones is credited with being an author of eprints (2) and (3)
+
* All three eprints are the output of a research group named "Senses"
+
 
+
===== Creating sets =====
+
 
+
Sets are groups of eprints, and every eprint is a member of at least one set (the set containing only that eprint).  From the information above, we have three sets.  The eprint set, the author set and the research group set.  We need to add the following to epstats_set_membership.txt (the format is <id><tab><csv list of eprint ids>
+
 
+
author_1        1,2
+
author_2        2,3
+
group_1        1,2,3
+
eprint_1        1
+
eprint_2        2
+
eprint_3        3
+
 
+
===== Giving Sets IDs =====
+
 
+
So, we now have some sets, but we need to give them unique IDs so that we can retrieve stats for these sets.  To do this, we add the following to epstats_set_member_codes.txt:
+
 
+
author_1        js
+
author_2        hj
+
group_1        senses
+
eprint_1        1
+
eprint_2        2
+
eprint_3        3
+
 
+
IRStats now assigns the following unique IDs to each set: author_js, author_hj, group_senses, eprint_1, eprint_2, eprint_3.  Note that the IDs should probably be kept alphanumeric, and must be unique within a class of sets (but you can have author_hj, group_hj and eprint_hj).
+
 
+
===== Citations =====
+
 
+
IRStats uses two citations for each set member, one short and one long.  Which you use depends on how you would like your visualisation to look.  However, we do need to add these to the citations files:
+
 
+
epstats_set_member_short_citations.txt
+
author_1        Smith
+
 
+
epstats_set_member_full_citations.txt
+
author_1        Dr John Smith, PhD
+
 
+
Note that the above examples are only for author_1.  It would be exactly the same for any set member.
+
 
+
===== URLs =====
+
 
+
Although URLs are not currently implemented, it is probably a good idea to include this information (in epstats_set_member_urls.txt) for future functionality.
+
 
+
author_1        http://homepage.john.smith.com/
+
 
+
== Installing IRStats ==
+
 
+
To run IRStats there are two separate processes that need to be completed:
+
 
+
* Creating the Log Files if the required format
+
* Running IRStats
+
 
+
=== Creating the Log Files ===
+
 
+
To create the log file it is recommended that you have the following installed:
+
 
+
==== Dependencies ====
+
 
+
===== Logfile::EPrints =====
+
 
+
The Logfile::Eprints modules are used to assist in filtering the raw access log. 
+
They can be installed from CPAN.
+
 
+
===== AWStats =====
+
 
+
AWStats data is used to filter out webspiders and classify search engines. This is a separate log analysing program and can be obtained from http://awstats.sourceforge.net/
+
 
+
Once AWStats is installed it is necessary to edit irstats.cfg to enter the correct path to the perl modules. The default path is /usr/local/awstats/wwwroot/cgi-bin/lib/search_engines.pm
+
 
+
===== Geo::IP or Geo::IP::PurePerl =====
+
 
+
Geo::IP is used to fill in country and organisation information.  The country database is free, but if you want organisation information, you will have to purchase a subscription for their database.
+
 
+
The pure perl version of Geo::IP which is Geo::IP::PurePerl is available from CPAN but does not support organisations.
+
 
+
===== MySQL =====
+
 
+
The information about the log files is stored in a database file so it is necessary to have a MySQL client and server running (or equivalent).
+
 
+
If you are importing the data from elsewhere rather than generating it yourself then the SQL to import the dump file is:
+
 
+
mysql -uroot --database=[database name] < [table name].dump
+
 
+
The minimum tables you need dump files for to create the standard graphs are:
+
* irstats_true_accesses_table
+
* irstats_column_referrer_scope
+
* irstats_column_referring_entity_id
+
* irstats_column_requester_host
+
* irstats_column_requester_organisation
+
* irstats_column_search_engine
+
* irstats_column_search_terms
+
 
+
Information about the database configuration needs to be set in the irstats.cfg file.
+
 
+
As well as the database tables it is necessary to create a user and password which the script can use to access the data and give that user the necessary permissions. The SQL is:
+
 
+
grant all privileges on [database name].* to [user name]@localhost identified by '[user password]';
+
 
+
=== Creating the Graphs ===
+
 
+
Once the log files are created IRStats has the following dependencies
+
 
+
==== Dependencies ====
+
 
+
===== Date::Calc =====
+
 
+
Date::Calc is used to control the periods that information is returned for. The module can be downloaded from CPAN
+
 
+
=== Installing ===
+
 
+
Once all the required programs and modules have been installed then IRStats can be installed and run.
+
 
+
The IRStats files should be copied, untarred if necessary, into the /opt/ directory
+
 
+
If IRStats is put elsewhere then the paths to the relevant files need to be set in the irstats.cfg directory. It is worth checking the irstats.cfg directory anyway to confirm that all the paths are set to the correct ones for your setup.
+
 
+
==== Folder Permissions ====
+
 
+
===== Folders requiring Read and Execute Permissions =====
+
 
+
* /irstats/cfg
+
* /irstats/cgi
+
* /irstats/perl_lib
+
 
+
===== Folders requiring Read and Write Permissions =====
+
 
+
* /irstats/cgi/view_thumbs
+
* /irstats/cache
+
* /irstats/img
+
 
+
==== Running the Perl Script to Populate the Database ====
+
 
+
In irstats.cfg edit the paths of the of the files used to store set information so they are correct. The default place for these files is in /opt/irstats/data/ so if the path is set to /opt/irstats/cfg/ it needs to be changed.
+
 
+
Due to a small bug it is necessary to open the irstats/bin/import_metadata.pl script and comment out the following lines before it is run (the lines can be commented out by adding a # at the start of each line):
+
 
+
$database->do_sql("DROP TABLE $table");
+
 
+
$database->do_sql("DROP TABLE $citation_table");
+
 
+
$database->do_sql("DROP TABLE $code_table");
+
 
+
 
+
Having commented out these lines run the perl script. This will populate the database with the necessary author, paper and group tables.
+
 
+
Once the script has been run successfully uncomment the three lines back into the script.
+
 
+
==== Configuring Apache ====
+
 
+
In the apache2 configuration file it is necessary to add the following lines:
+
 
+
Alias /stats/view_thumbs /opt/irstats/cgi/view_thumbs
+
 
+
ScriptAlias /stats/ /opt/irstats/cgi/
+
 
+
Alias /img/ /opt/irstats/img/
+
 
+
 
+
(Don't forget to restart apache after you have made the changes to the config file)
+
 
+
=== Customising ===
+
 
+
It will almost always be necessary to perform some customisation on IRStats because every repository is different.
+
 
+
==== Updating the Table ====
+
 
+
The tables are updated by running the update_tables.pl script which is located in the /data/ folder. This script needs to be run whenever the tables need to be changed. For most systems it is recommended that the script is automatically run at a given interval, for example once a night.
+
 
+
==== Creating New Views ====
+
 
+
 
+
A view has three subs:
+
 
+
* initialise
+
* new
+
* populate
+
 
+
The basic view program looks like this:
+
 
+
<pre>
+
package IRStats::View::<View Name Here>;
+
 
+
use strict;
+
use warnings;
+
 
+
use IRStats::DatabaseInterface;
+
use IRStats::Cache;
+
use IRStats::Visualisation::<Visualisation Module Here>;
+
use IRStats::View;
+
use Data::Dumper;
+
 
+
 
+
our @ISA = qw/ IRStats::View /;
+
 
+
sub initialise
+
{
+
<Initialisation Code Here>
+
}
+
 
+
sub new
+
{
+
<New Code Here>
+
}
+
 
+
sub populate
+
{
+
<Population Code Here>
+
}
+
 
+
1;
+
 
+
</pre>
+
 
+
Considering each of the subs:
+
 
+
=====initialise=====
+
 
+
<pre>
+
my ($self) = @_;
+
</pre>
+
 
+
 
+
Define SQL Parameters:
+
 
+
<pre>
+
$self->{'sql_params'} ={
+
<parameters go here in comma-separated list>
+
};
+
</pre>
+
 
+
The paramters are:
+
 
+
* columns (which columns to return. May include 'COUNT')
+
* where (any conditionals, divided into column, operator and value. Multiple conditionals can be added as a comma-separated list with each set of conditional statements surrounded by curly brackets)
+
* group (what the returned information should be grouped by)
+
* order (the order that the returned information should be in. Divided into column and direction.)
+
* limit (limit on the number of answers)
+
 
+
Only necessary to include the Parameters that you need to set.
+
 
+
For example:
+
<pre>
+
$self->{'sql_params'} = {
+
columns => [ 'eprint', 'COUNT' ],
+
group => "eprint",
+
order => {column => "COUNT", direction => "DESC"},
+
limit => 10
+
};
+
</pre>
+
 
+
Having defined the SQL parameters it is necessary to set up the graph constructor
+
<pre>
+
        $self->{'visualisation'} = <Graph Module here>->new(
+
{
+
<Parameters go here>
+
}
+
        );
+
</pre>
+
The Graph Constructors are:
+
 
+
* IRStats::Visualisation::Graph::Bar
+
* IRStats::Visualisation::Graph::Line
+
* IRStats::Visualisation::Graph::Pie
+
* IRStats::Visualisation::Populate: Table::HTML & Table::CSV::CSV
+
* IRStats::Visualisation::Populate: Table::HTML & Table::CSV
+
* IRStats::Visualisation::Table::HTML_Columned
+
* IRStats::Visualisation::HTML
+
 
+
Different constructors have different parameters:
+
 
+
 
+
====== Graphs: ======
+
 
+
'''IRStats::Visualisation::Graph::Bar''' and '''IRStats::Visualisation::Graph::Line'''
+
 
+
filename => $self->{'params'}->get('id') . ".png",
+
title => "<Your Title Here>",
+
x_title => "<You X-Axis Title Here>",
+
y_title => "<You Y-Axis Title Here>",
+
data_series => [],
+
x_labels => [],
+
params => $self->{params}
+
 
+
'''IRStats::Visualisation::Graph::Pie'''
+
 
+
filename => $self->{'params'}->get('id') . ".png",
+
title => "<Your Title Here>",
+
data_series => [],
+
params => $self->{params}
+
 
+
IRStats::Visualisation::Populate: '''Table::HTML''' & '''Table::CSV::CSV''' and '''IRStats::Visualisation::Table::HTML''' & '''Table::CSV'''
+
 
+
columns => [<Comma-Separated List of Column Headers Here>],
+
rows => []
+
 
+
IRStats::Visualisation::Populate: '''Table::HTML''' & '''Table::CSV_Columned'''
+
 
+
title => "<Your Title Here>",
+
columns => [<Comma-Separated List of Column Headers Here>],
+
rows => []
+
 
+
'''IRStats::Visualisation::HTML'''
+
 
+
html => '<Any Default HTML Goes Here>'
+
 
+
 
+
 
+
Having created the constructor, you may wish to create a number of global parameters to store information such asm the maxium number of rows. In which case *after* the constructor you add the line
+
 
+
$self->{<Your Parameter>} = <Your Value Here>;
+
 
+
 
+
So the whole thing should look like:
+
<pre>
+
sub initialise
+
{
+
        my ($self) = @_;
+
$self->{'sql_params'} = {
+
<Your Parameters Here>
+
};
+
        $self->{'visualisation'} = <Your Visualation Type Here> ->new(
+
{
+
<Your Parameters Here>
+
}
+
        );
+
<Any Additional Parameters Here>
+
}
+
</pre>
+
 
+
===== New =====
+
<pre>
+
sub new
+
{
+
        my( $class, $params, $database ) = @_;
+
        my $self = $class->SUPER::new($params, $database);;
+
        $self->initialise();
+
        return $self;
+
}
+
</pre>
+
 
+
===== Populate =====
+
 
+
Populate is the complicated section where the main programming takes place.
+
 
+
It almost always starts with the following delarations:
+
 
+
<pre>
+
my ($self) = @_;
+
 
+
##Check Cache
+
my $cache = IRStats::Cache->new($self->{'params'});
+
if ($cache->exists)
+
{
+
$self->{'visualisation'} = $cache->read();
+
return;
+
}
+
</pre>
+
 
+
and ends:
+
 
+
<pre>
+
 
+
$self->{'visualisation'}->set('x_labels', $x_labels);
+
$self->{'visualisation'}->set('data_series', $data_series);
+
 
+
##write to cache
+
$cache->write($self->{'visualisation'});
+
 
+
</pre>
+
 
+
although the setting of the $self->{'visualisation'} depends on which visualisations are needed. The following is a general guide:
+
 
+
'''Graphs (Not Pie)''':
+
 
+
$self->{'visualisation'}->set('x_labels', $x_labels);
+
$self->{'visualisation'}->set('data_series', $data_series);
+
 
+
'''Pie Graphs''':
+
 
+
$self->{'visualisation'}->set('data_series', $data_series);
+
 
+
'''Plain HTML''':
+
 
+
$self->{'visualisation'}->set('html',$html);
+
 
+
'''Tables''' and '''CSV''':
+
 
+
$self->{'visualisation'}->set('rows',$rows);
+
 
+
 
+
The sub should also contain a call to the database to carry out the previously defined query
+
 
+
<pre>
+
      <define variables>
+
 
+
my $query = $self->{'database'}->get_stats(
+
$self->{'params'},
+
$self->{'sql_params'}
+
);
+
 
+
while ( my @row = $query->fetchrow_array() )
+
{
+
<assign the results to the relevant variables>
+
}
+
$query->finish();
+
 
+
</pre>
+
 
+
As well as the above the populate sub contains the code to analyze, alter and manipulate the data retrieved from the database before publishing it as a graph.
+
 
+
The most basic function resembles the following:
+
 
+
<pre>
+
 
+
my ($self) = @_;
+
##Check Cache
+
my $cache = IRStats::Cache->new($self->{'params'});
+
if ($cache->exists)
+
{
+
$self->{'visualisation'} = $cache->read();
+
return;
+
}
+
 
+
<create variables e.g. my $rows = [];>
+
 
+
my $query = $self->{'database'}->get_stats(
+
$self->{'params'},
+
$self->{'sql_params'}
+
);
+
 
+
while ( my @row = $query->fetchrow_array() )
+
{
+
<process and store data e.g. push @{$rows}, \@row;>
+
}
+
$query->finish();
+
 
+
<send to visualisation e.g. $self->{'visualisation'}->set('rows',$rows);>
+
 
+
##write to cache
+
$cache->write($self->{'visualisation'});
+
 
+
</pre>
+

Revision as of 12:33, 7 July 2011

IRStats is a flexible statistics package which allows easy processing of accesses to fulltext documents of eprints. For more detailed information, please see the IRStats Technical Documentation, though it is now somewhat out of date.

The front end

The configuration file

Documentation to follow.