Difference between revisions of "IRStats"

From EPrints Documentation
Jump to: navigation, search
m (Populate)
(18 intermediate revisions by 4 users not shown)
Line 1: Line 1:
IRStats is a flexible statistics package which allows easy processing of accesses to fulltext and abstract pages of eprints. For more detailed information, please see the [[IRStats Technical Documentation]].
+
[[Category:IRStats]]
 +
<div style="border: 2px solid red; background-color: yellow;padding:10px">This is IRStats 1 documentation. IRStats 1 is now out of support. You may have been looking for [[IRStats2]]</div>
  
== Technical Overview ==
+
IRStats is a flexible statistics package which allows easy processing of accesses to fulltext documents of eprints. It can be downloaded from the [http://files.eprints.org/722/ Eprints File repository]. For more detailed information, please see the [[IRStats Technical Documentation]], though it is now somewhat out of date.
  
The following is a quick tour of IRStats.
+
== The front end ==
  
=== Parameters ===
+
===The Query Form===
  
IRStats output depends on four parameters, which need to be passed as cgi parameters if called through a web browser, or in a hash if called through the Perl API.  These are:
+
The main interface to IRStats is found at the following URL (given a repository base URL of myrepository.ac.uk):
 
 
==== Start Date and End Date ====
 
 
 
Date parameters are implemented as separate day, month and year parameters, so these two parameters are actually six (start_day, start_month, start_year, end_day, end_month, end_year).  Any statistics outside this daterange are ignored.
 
 
 
==== An Eprint Set ====
 
 
 
As well as defining a daterange, we also have to inform IRStats of which publications we are interested in.  Any publication not in the set will be ignored.  A set of eprints can either be a single eprint or any set of eprints the system administrator wishes to define in the config files.
 
 
 
==== View ====
 
 
 
The final parameter tells IRStats how we want to process and display the statistics.  This is done by selecting a View.
 
 
 
=== Views ===
 
 
 
Views are perl modules which plug in to IRStats.  They have been designed to be user configurable, though some knowledge of perl is probably required.  When a query is made to IRStats, a View is created.  It generates some parameters for the DatabaseInterface object, which queries the database and passes back the results of the query.  The View then iterates over the database rows and processes the stats in any way programmatically possible.  These processed results are then passed to a Visualisation.
 
 
 
=== Visualisations ===
 
 
 
A Visualisation takes a set of processed statistics and outputs them.  For example, Visualisation::Graph::Pie creates a pie chart.
 
 
 
=== The Database Interface ===
 
 
 
The Database Interface object handles all queries to the database.  Most requests for statistics can be completed with a single call to the get_stats($params) method.
 
 
 
=== Data Flow Diagram ===
 
[[Image:irstats_overview.png]]
 
 
 
== Required Data ==
 
 
 
In order for IRStats to run, it requires two things:
 
 
 
* a database table containing all hits to the repository
 
* text files describing the contents of the repository
 
 
 
=== The Hits Table ===
 
 
 
Awaiting a redevelopment.
 
 
 
=== The Text Files ===
 
 
 
In order for IRStats to build up a picture of a repository, a number of text files need to be created and stored in the cfg/ directory:
 
 
 
* epstats_set_membership.txt
 
* epstats_set_member_codes.txt
 
* epstats_set_member_full_citations.txt
 
* epstats_set_member_short_citations.txt
 
* epstats_set_member_urls.txt
 
 
 
==== Explanation by Example ====
 
 
 
Imagine a very small repository.  Here are its contents:
 
 
 
* eprints
 
** (1) The Smells of Cheese
 
** (2) The Tastes of Wines
 
** (3) The Sounds of Oboes
 
* Authors
 
** (1) John Smith
 
** (2) Harriet Jones
 
 
 
If we then imagine that the following are also true:
 
 
 
* John Smith is credited with being an author of eprints (1) and (2)
 
* Harriet Jones is credited with being an author of eprints (2) and (3)
 
* All three eprints are the output of a research group named "Senses"
 
 
 
===== Creating sets =====
 
 
 
Sets are groups of eprints, and every eprint is a member of at least one set (the set containing only that eprint).  From the information above, we have three sets.  The eprint set, the author set and the research group set.  We need to add the following to epstats_set_membership.txt (the format is <id><tab><csv list of eprint ids>
 
 
 
author_1        1,2
 
author_2        2,3
 
group_1        1,2,3
 
eprint_1        1
 
eprint_2        2
 
eprint_3        3
 
 
 
===== Giving Sets IDs =====
 
 
 
So, we now have some sets, but we need to give them unique IDs so that we can retrieve stats for these sets.  To do this, we add the following to epstats_set_member_codes.txt:
 
 
 
author_1        js
 
author_2        hj
 
group_1        senses
 
eprint_1        1
 
eprint_2        2
 
eprint_3        3
 
 
 
IRStats now assigns the following unique IDs to each set: author_js, author_hj, group_senses, eprint_1, eprint_2, eprint_3.  Note that the IDs should probably be kept alphanumeric, and must be unique within a class of sets (but you can have author_hj, group_hj and eprint_hj).
 
 
 
===== Citations =====
 
 
 
IRStats uses two citations for each set member, one short and one long.  Which you use depends on how you would like your visualisation to look.  However, we do need to add these to the citations files:
 
 
 
epstats_set_member_short_citations.txt
 
author_1        Smith
 
 
 
epstats_set_member_full_citations.txt
 
author_1        Dr John Smith, PhD
 
 
 
Note that the above examples are only for author_1.  It would be exactly the same for any set member.
 
 
 
===== URLs =====
 
 
 
Although URLs are not currently implemented, it is probably a good idea to include this information (in epstats_set_member_urls.txt) for future functionality.
 
 
 
author_1        http://homepage.john.smith.com/
 
 
 
== Installing IRStats ==
 
 
 
To run IRStats there are two separate processes that need to be completed:
 
 
 
* Creating the Log Files if the required format
 
* Running IRStats
 
 
 
=== Creating the Log Files ===
 
 
 
To create the log file it is recommended that you have the following installed:
 
 
 
==== Dependencies ====
 
 
 
===== Logfile::EPrints =====
 
 
 
The Logfile::Eprints modules are used to assist in filtering the raw access log. 
 
They can be installed from CPAN.
 
 
 
===== AWStats =====
 
 
 
AWStats data is used to filter out webspiders and classify search engines. This is a separate log analysing program and can be obtained from http://awstats.sourceforge.net/
 
 
 
Once AWStats is installed it is necessary to edit irstats.cfg to enter the correct path to the perl modules. The default path is /usr/local/awstats/wwwroot/cgi-bin/lib/search_engines.pm
 
 
 
===== Geo::IP or Geo::IP::PurePerl =====
 
 
 
Geo::IP is used to fill in country and organisation information.  The country database is free, but if you want organisation information, you will have to purchase a subscription for their database.
 
 
 
The pure perl version of Geo::IP which is Geo::IP::PurePerl is available from CPAN but does not support organisations.
 
 
 
===== MySQL =====
 
 
 
The information about the log files is stored in a database file so it is necessary to have a MySQL client and server running (or equivalent).
 
 
 
If you are importing the data from elsewhere rather than generating it yourself then the SQL to import the dump file is:
 
 
 
mysql -uroot --database=[database name] < [table name].dump
 
 
 
The minimum tables you need dump files for to create the standard graphs are:
 
* irstats_true_accesses_table
 
* irstats_column_referrer_scope
 
* irstats_column_referring_entity_id
 
* irstats_column_requester_host
 
* irstats_column_requester_organisation
 
* irstats_column_search_engine
 
* irstats_column_search_terms
 
 
 
Information about the database configuration needs to be set in the irstats.cfg file.
 
 
 
As well as the database tables it is necessary to create a user and password which the script can use to access the data and give that user the necessary permissions. The SQL is:
 
 
 
grant all privileges on [database name].* to [user name]@localhost identified by '[user password]';
 
 
 
=== Creating the Graphs ===
 
 
 
Once the log files are created IRStats has the following dependencies
 
 
 
==== Dependencies ====
 
 
 
===== Date::Calc =====
 
 
 
Date::Calc is used to control the periods that information is returned for. The module can be downloaded from CPAN
 
 
 
=== Installing ===
 
 
 
Once all the required programs and modules have been installed then IRStats can be installed and run.
 
 
 
The IRStats files should be copied, untarred if necessary, into the /opt/ directory
 
 
 
If IRStats is put elsewhere then the paths to the relevant files need to be set in the irstats.cfg directory. It is worth checking the irstats.cfg directory anyway to confirm that all the paths are set to the correct ones for your setup.
 
 
 
==== Folder Permissions ====
 
 
 
===== Folders requiring Read and Execute Permissions =====
 
 
 
* /irstats/cfg
 
* /irstats/cgi
 
* /irstats/perl_lib
 
 
 
===== Folders requiring Read and Write Permissions =====
 
 
 
* /irstats/cgi/view_thumbs
 
* /irstats/cache
 
* /irstats/img
 
 
 
==== Running the Perl Script to Populate the Database ====
 
 
 
In irstats.cfg edit the paths of the of the files used to store set information so they are correct. The default place for these files is in /opt/irstats/data/ so if the path is set to /opt/irstats/cfg/ it needs to be changed.
 
 
 
Due to a small bug it is necessary to open the irstats/bin/import_metadata.pl script and comment out the following lines before it is run (the lines can be commented out by adding a # at the start of each line):
 
 
 
$database->do_sql("DROP TABLE $table");
 
 
 
$database->do_sql("DROP TABLE $citation_table");
 
 
 
$database->do_sql("DROP TABLE $code_table");
 
 
 
 
 
Having commented out these lines run the perl script. This will populate the database with the necessary author, paper and group tables.
 
 
 
Once the script has been run successfully uncomment the three lines back into the script.
 
 
 
==== Configuring Apache ====
 
 
 
In the apache2 configuration file it is necessary to add the following lines:
 
 
 
Alias /stats/view_thumbs /opt/irstats/cgi/view_thumbs
 
 
 
ScriptAlias /stats/ /opt/irstats/cgi/
 
 
 
Alias /img/ /opt/irstats/img/
 
 
 
 
 
(Don't forget to restart apache after you have made the changes to the config file)
 
 
 
=== Customising ===
 
 
 
It will almost always be necessary to perform some customisation on IRStats because every repository is different.
 
 
 
==== Updating the Table ====
 
 
 
The tables are updated by running the update_tables.pl script which is located in the /data/ folder. This script needs to be run whenever the tables need to be changed. For most systems it is recommended that the script is automatically run at a given interval, for example once a night.
 
 
 
==== Creating New Views ====
 
 
 
 
 
A view has three subs:
 
 
 
* initialise
 
* new
 
* populate
 
 
 
The basic view program looks like this:
 
 
 
<pre>
 
package IRStats::View::<View Name Here>;
 
 
 
use strict;
 
use warnings;
 
 
 
use IRStats::DatabaseInterface;
 
use IRStats::Cache;
 
use IRStats::Visualisation::<Visualisation Module Here>;
 
use IRStats::View;
 
use Data::Dumper;
 
 
 
 
 
our @ISA = qw/ IRStats::View /;
 
 
 
sub initialise
 
{
 
<Initialisation Code Here>
 
}
 
 
 
sub new
 
{
 
<New Code Here>
 
}
 
 
 
sub populate
 
{
 
<Population Code Here>
 
}
 
 
 
1;
 
 
 
</pre>
 
 
 
Considering each of the subs:
 
 
 
=====initialise=====
 
 
 
<pre>
 
my ($self) = @_;
 
</pre>
 
 
 
 
 
Define SQL Parameters:
 
 
 
<pre>
 
$self->{'sql_params'} ={
 
<parameters go here in comma-separated list>
 
};
 
</pre>
 
 
 
The paramters are:
 
 
 
* columns (which columns to return. May include 'COUNT')
 
* where (any conditionals, divided into column, operator and value. Multiple conditionals can be added as a comma-separated list with each set of conditional statements surrounded by curly brackets)
 
* group (what the returned information should be grouped by)
 
* order (the order that the returned information should be in. Divided into column and direction.)
 
* limit (limit on the number of answers)
 
 
 
Only necessary to include the Parameters that you need to set.
 
 
 
For example:
 
<pre>
 
$self->{'sql_params'} = {
 
columns => [ 'eprint', 'COUNT' ],
 
group => "eprint",
 
order => {column => "COUNT", direction => "DESC"},
 
limit => 10
 
};
 
</pre>
 
 
 
Having defined the SQL parameters it is necessary to set up the graph constructor
 
<pre>
 
        $self->{'visualisation'} = <Graph Module here>->new(
 
{
 
<Parameters go here>
 
}
 
        );
 
</pre>
 
The Graph Constructors are:
 
 
 
* IRStats::Visualisation::Graph::Bar
 
* IRStats::Visualisation::Graph::Line
 
* IRStats::Visualisation::Graph::Pie
 
* IRStats::Visualisation::Populate: Table::HTML & Table::CSV::CSV
 
* IRStats::Visualisation::Populate: Table::HTML & Table::CSV
 
* IRStats::Visualisation::Table::HTML_Columned
 
* IRStats::Visualisation::HTML
 
 
 
Different constructors have different parameters:
 
 
 
 
 
====== Graphs: ======
 
 
 
'''IRStats::Visualisation::Graph::Bar''' and '''IRStats::Visualisation::Graph::Line'''
 
 
 
filename => $self->{'params'}->get('id') . ".png",
 
title => "<Your Title Here>",
 
x_title => "<You X-Axis Title Here>",
 
y_title => "<You Y-Axis Title Here>",
 
data_series => [],
 
x_labels => [],
 
params => $self->{params}
 
 
 
'''IRStats::Visualisation::Graph::Pie'''
 
 
 
filename => $self->{'params'}->get('id') . ".png",
 
title => "<Your Title Here>",
 
data_series => [],
 
params => $self->{params}
 
 
 
IRStats::Visualisation::Populate: '''Table::HTML''' & '''Table::CSV::CSV''' and '''IRStats::Visualisation::Table::HTML''' & '''Table::CSV'''
 
 
 
columns => [<Comma-Separated List of Column Headers Here>],
 
rows => []
 
 
 
IRStats::Visualisation::Populate: '''Table::HTML''' & '''Table::CSV_Columned'''
 
 
 
title => "<Your Title Here>",
 
columns => [<Comma-Separated List of Column Headers Here>],
 
rows => []
 
 
 
'''IRStats::Visualisation::HTML'''
 
 
 
html => '<Any Default HTML Goes Here>'
 
 
 
 
 
 
 
Having created the constructor, you may wish to create a number of global parameters to store information such asm the maxium number of rows. In which case *after* the constructor you add the line
 
 
 
$self->{<Your Parameter>} = <Your Value Here>;
 
 
 
 
 
So the whole thing should look like:
 
<pre>
 
sub initialise
 
{
 
        my ($self) = @_;
 
$self->{'sql_params'} = {
 
<Your Parameters Here>
 
};
 
        $self->{'visualisation'} = <Your Visualation Type Here> ->new(
 
{
 
<Your Parameters Here>
 
}
 
        );
 
<Any Additional Parameters Here>
 
}
 
</pre>
 
 
 
===== New =====
 
<pre>
 
sub new
 
{
 
        my( $class, $params, $database ) = @_;
 
        my $self = $class->SUPER::new($params, $database);;
 
        $self->initialise();
 
        return $self;
 
}
 
</pre>
 
 
 
===== Populate =====
 
 
 
Populate is the complicated section where the main programming takes place.
 
 
 
It almost always starts with the following delarations:
 
 
 
<pre>
 
my ($self) = @_;
 
 
 
##Check Cache
 
my $cache = IRStats::Cache->new($self->{'params'});
 
if ($cache->exists)
 
{
 
$self->{'visualisation'} = $cache->read();
 
return;
 
}
 
</pre>
 
 
 
and ends:
 
  
 
<pre>
 
<pre>
 
+
myrepository.ac.uk/cgi/irstats.cgi
$self->{'visualisation'}->set('x_labels', $x_labels);
 
$self->{'visualisation'}->set('data_series', $data_series);
 
 
 
##write to cache
 
$cache->write($self->{'visualisation'});
 
 
 
 
</pre>
 
</pre>
  
although the setting of the $self->{'visualisation'} depends on which visualisations are needed. The following is a general guide:
+
You will be presented with a form allowing you to select the parameters with which to generate a report.
 
 
'''Graphs (Not Pie)''':
 
 
 
$self->{'visualisation'}->set('x_labels', $x_labels);
 
$self->{'visualisation'}->set('data_series', $data_series);
 
  
'''Pie Graphs''':
+
===Advanced Report Generation (get_view2 params)===
  
$self->{'visualisation'}->set('data_series', $data_series);
+
The following will help if you wish to create queries by setting the CGI parameters by hand.
  
'''Plain HTML''':
+
There are three fundamental parameters that IRStats uses.  There are:
 +
* A Date Range (actually 6 parameters for day, month and year for both start and end dates)
 +
* A Set of EPrints
 +
* A View
  
  $self->{'visualisation'}->set('html',$html);
+
However, in order to add functionality, the get_view2 page will convert a larger number of parameters into these three. The following table shows all parameters and values, with square brackets denoting variables.
  
'''Tables''' and '''CSV''':
 
  
  $self->{'visualisation'}->set('rows',$rows);
+
{| border="1"
 +
! Parameter
 +
! Possible Values
 +
! Notes
 +
|-
 +
| IRS_datechoice || period, range || Controls whether the 6 date range parameters or the single period parameter is used.
 +
|-
 +
| period || -[X]m, Q[Z][YYYY] || Used when IRS_datechoice=period.<br/>Where m and Q are literal characters, X is a positive integer, Z is an integer in the range 1 to 4 and YYYY is a four digit year.<br/>Examples: <dl><dt>-4m<dd>Go back exactly four months from today's date<dt>Q32004<dd>Quarter 3, 2004</dl>
 +
|-
 +
| start_day, start_month, start_year, end_day, end_month, end_year || integers (1-31, 1-12, four digit respectively) || Used when IRS_datechoice=range.<br/> Note that if a day value is higher than the highest day in the chosen month, it will be treated as the highest day -- e.g. start_day=31&start_month=02 is seen as valid and equivalent to February 28th. Note that start_day=99 is also valid!
 +
|-
 +
| IRS_epchoice || All, EPrint, [set_id] || Controls whether stats will be generated on all eprints, a single eprints, or a set of eprints.  The 'All' option is the only one that does not require extra parameters.  Note that 'set_id' is the id of a valid set as defined in the IRStats configuration.
 +
|-
 +
| eprint || [eprintid] || Used when IRS_epchoice=EPrint.<br/>Any valid eprint ID (integer).
 +
|-
 +
| [set_id]s || [set_id]_[set_member_code] || Used when IRS_epchoice=[set_id].<br/>  Best described through example: <dl><dt>IRS_epchoice=divisions&divisionss=divisions_art<dd>Will generate a report on the art department, given a standard EPrints repository and IRStats config, where the subject id 'art' exists in the divisions tree in EPrints.
 +
|-
 +
| view || [view classname] || The classname of the IRStats::View perl module.
 +
|}
  
 +
===The Dashboard Form===
  
The sub should also contain a call to the database to carry out the previously defined query
+
A dashboard is a collection of reports on a single item or set of items (e.g. all items by John Smith).  To access the form to generate a report, go to the url:
  
 
<pre>
 
<pre>
      <define variables>
+
myrepository.ac.uk/cgi/irstats.cgi?page=db
 
 
my $query = $self->{'database'}->get_stats(
 
$self->{'params'},
 
$self->{'sql_params'}
 
);
 
 
 
while ( my @row = $query->fetchrow_array() )
 
{
 
<assign the results to the relevant variables>
 
}
 
$query->finish();
 
 
 
 
</pre>
 
</pre>
  
As well as the above the populate sub contains the code to analyze, alter and manipulate the data retrieved from the database before publishing it as a graph.
+
== The configuration file ==
 
 
The most basic function resembles the following:
 
 
 
<pre>
 
 
 
my ($self) = @_;
 
##Check Cache
 
my $cache = IRStats::Cache->new($self->{'params'});
 
if ($cache->exists)
 
{
 
$self->{'visualisation'} = $cache->read();
 
return;
 
}
 
  
<create variables e.g. my $rows = [];>
+
Documentation to follow.
 
 
my $query = $self->{'database'}->get_stats(
 
$self->{'params'},
 
$self->{'sql_params'}
 
);
 
 
 
while ( my @row = $query->fetchrow_array() )
 
{
 
<process and store data e.g. push @{$rows}, \@row;>
 
}
 
$query->finish();
 
 
 
<send to visualisation e.g. $self->{'visualisation'}->set('rows',$rows);>
 
 
 
##write to cache
 
$cache->write($self->{'visualisation'});
 
 
 
</pre>
 

Revision as of 10:14, 13 December 2017

This is IRStats 1 documentation. IRStats 1 is now out of support. You may have been looking for IRStats2

IRStats is a flexible statistics package which allows easy processing of accesses to fulltext documents of eprints. It can be downloaded from the Eprints File repository. For more detailed information, please see the IRStats Technical Documentation, though it is now somewhat out of date.

The front end

The Query Form

The main interface to IRStats is found at the following URL (given a repository base URL of myrepository.ac.uk):

myrepository.ac.uk/cgi/irstats.cgi

You will be presented with a form allowing you to select the parameters with which to generate a report.

Advanced Report Generation (get_view2 params)

The following will help if you wish to create queries by setting the CGI parameters by hand.

There are three fundamental parameters that IRStats uses. There are:

  • A Date Range (actually 6 parameters for day, month and year for both start and end dates)
  • A Set of EPrints
  • A View

However, in order to add functionality, the get_view2 page will convert a larger number of parameters into these three. The following table shows all parameters and values, with square brackets denoting variables.


Parameter Possible Values Notes
IRS_datechoice period, range Controls whether the 6 date range parameters or the single period parameter is used.
period -[X]m, Q[Z][YYYY] Used when IRS_datechoice=period.
Where m and Q are literal characters, X is a positive integer, Z is an integer in the range 1 to 4 and YYYY is a four digit year.
Examples:
-4m
Go back exactly four months from today's date
Q32004
Quarter 3, 2004
start_day, start_month, start_year, end_day, end_month, end_year integers (1-31, 1-12, four digit respectively) Used when IRS_datechoice=range.
Note that if a day value is higher than the highest day in the chosen month, it will be treated as the highest day -- e.g. start_day=31&start_month=02 is seen as valid and equivalent to February 28th. Note that start_day=99 is also valid!
IRS_epchoice All, EPrint, [set_id] Controls whether stats will be generated on all eprints, a single eprints, or a set of eprints. The 'All' option is the only one that does not require extra parameters. Note that 'set_id' is the id of a valid set as defined in the IRStats configuration.
eprint [eprintid] Used when IRS_epchoice=EPrint.
Any valid eprint ID (integer).
[set_id]s [set_id]_[set_member_code] Used when IRS_epchoice=[set_id].
Best described through example:
IRS_epchoice=divisions&divisionss=divisions_art
Will generate a report on the art department, given a standard EPrints repository and IRStats config, where the subject id 'art' exists in the divisions tree in EPrints.
view [view classname] The classname of the IRStats::View perl module.

The Dashboard Form

A dashboard is a collection of reports on a single item or set of items (e.g. all items by John Smith). To access the form to generate a report, go to the url:

myrepository.ac.uk/cgi/irstats.cgi?page=db

The configuration file

Documentation to follow.