Difference between revisions of "IRStats Technical Documentation"

From EPrints Documentation
Jump to: navigation, search
(View)
m (category and redirection updated)
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This document is intended as guidance to the last stage of development of EPstats.
+
[[Category:Obsolete]]
 
+
<div style="border: 2px solid red; background-color: yellow;padding:10px">This is IRStats 1 documentation. IRStats 1 is now out of support. You may have been looking for [[IRStats2]]</div>
 
= Directory Structure =
 
= Directory Structure =
  
== /opt/epstats ==
+
== /opt/irstats/bin ==
Contains data files for GeoIP.  If I had had root access, I would have put them in the correct place.  They are linked to from the correct place.  These need regular updating, something which hasn't been implemented.
 
 
 
== /opt/epstats/bin ==
 
 
Contains the scripts needed to update the table.
 
Contains the scripts needed to update the table.
  
 
*daily_update.sh - Runs all the scripts in the right order.
 
*daily_update.sh - Runs all the scripts in the right order.
 
*extract_metadata_from_archive.pl - Extracts eprint, author and group metadata from the repository by iterating over every eprint.
 
*extract_metadata_from_archive.pl - Extracts eprint, author and group metadata from the repository by iterating over every eprint.
*update_table.pl - Filters and processes new entries in the accesslog to update the epstats_true_acesses_table.  Uses 'SearchParser.pm' and 'repeatscache'.
+
*update_table.pl - Filters and processes new entries in the accesslog to update the irstats_true_acesses_table.  Uses 'SearchParser.pm' and 'repeatscache'.
* convert_ip_to_host.pl - Attempts to convert ip addresses of the new entries in epstats_true_acesses_table to hostnames.  Uses 'host_updated' to keep track of where it got to last time.
+
* convert_ip_to_host.pl - Attempts to convert ip addresses of the new entries in irstats_true_acesses_table to hostnames.  Uses 'host_updated' to keep track of where it got to last time.
  
 
Note that most of these scripts probably need to be tidied up.  They were written in a hurry and were never polished.
 
Note that most of these scripts probably need to be tidied up.  They were written in a hurry and were never polished.
  
== /opt/epstats/cache ==
+
== /opt/irstats/cache ==
Contains cache files.  Feel free to delete these whenever you like.
+
Contains cache files.  These should probably be deleted whenever the database is updated.
  
== /opt/epstats/cgi ==
+
== /opt/irstats/cgi ==
  
Contains two scripts, 'get_view' and 'stats'.
+
Contains two scripts, 'get_view and 'stats'.
  
*get_view returns the output of a EPstats::View (see below), which is currently a chunk of html or csv, but could be almost anything.
+
*get_view returns the output of a IRStats::View (see below), which is currently a chunk of html or csv, but could be almost anything.
 
*stats is a handy cgi form that passes arguements to get_view
 
*stats is a handy cgi form that passes arguements to get_view
  
== /opt/epstats/img ==
+
== /opt/irstats/img ==
  
 
Conceptually, where any images would be kept (e.g. national flags).  At the moment, only the img/graphs directory is used.  This is where generated graphs are stored.
 
Conceptually, where any images would be kept (e.g. national flags).  At the moment, only the img/graphs directory is used.  This is where generated graphs are stored.
  
== /opt/epstats/perl_lib ==
+
== /opt/irstats/cfg ==
 +
 
 +
Where the configuration file and the text files containing repository data are held.
 +
 
 +
=== The Configuration File ===
 +
 
 +
irstats.cfg contains a number of configuration strings.  Here are some of the more important ones, with the default in brackets:
 +
 
 +
*configuration_path (/opt/irstats/cfg/) - The path of the configuration directory.
 +
*view_path (/opt/irstats/perl_lib/IRStats/View/) - The directory containing the Views.
 +
*cache_path (/opt/irstats/cache/) - The directory in which to store cache files.
 +
*graph_path (/opt/irstats/img/graphs/) - The directory in which to store graph images.
 +
*graph_relative_url_path (/img/graphs/) - The url of the directory in which the graph file is from the point of view of the web browser.
 +
*update_lock_filename (/opt/irstats/bin/.lock) - The name of the file that is created to prevent the update process running twice concurrently
 +
*The names of the files used to store set information
 +
**set_member_full_citations_file (/opt/irstats/cfg/irstats_set_member_full_citations.txt)
 +
**set_member_short_citations_file (/opt/irstats/cfg/irstats_set_member_short_citations.txt)
 +
**set_membership_file (/opt/irstats/cfg/irstats_set_membership.txt)
 +
**set_member_codes_file (/opt/irstats/cfg/irstats_set_member_codes.txt)
 +
**set_member_urls_file (/opt/irstats/cfg/irstats_set_member_urls.txt)
 +
*Referrer Scope Labels (note, if you change these, you should also change them in the database)
 +
**referrer_scope_1 (Internal)
 +
**referrer_scope_2 (ECS)
 +
**referrer_scope_3 (Search)
 +
**referrer_scope_4 (External)
 +
**referrer_scope_no_referrer (None)
 +
*awstats_search_engines (/usr/local/awstats/wwwroot/cgi-bin/lib/search_engines.pm) - The path to the awstats search engine module
 +
*repeats_filter_file (/opt/irstats/bin/repeatscache) - The file to maintain state between updates
 +
*repeats_filter_timeout (86400) - repeat timeout in seconds (the amount of time there needs to be between two hits for them both to be recorded, initially set to 60*60*24)
 +
 
 +
*repository_url = http://eprints.ecs.soton.ac.uk - the path to the repository
 +
 
 +
*database configuration
 +
**database_driver (mysql)
 +
**database_server (localhost)
 +
**database_name
 +
**database_user
 +
**database_password
 +
 
 +
*database_id_columns ([ requester_organisation, requester_host, referrer_scope, search_engine, search_terms, referring_entity_id ]) - The columns in the database that have a UID rather than data.  These need seperate tables in which to store the data.
 +
 
 +
*Various table names and parts of names
 +
**database_eprints_access_log_table (accesslog) ##Perhaps remove after update rewrite.
 +
**database_main_stats_table (irstats_true_accesses_table)
 +
**database_column_table_prefix (irstats_column_)
 +
**database_set_table_prefix (irstats_set_)
 +
**database_set_table_code_suffix (_code)
 +
**database_set_table_citation_suffix (_citation)
 +
 
 +
*id_parameters ([ start_date, end_date, eprints, view ]) - the parameters that are used to uniquely identify a view
 +
*host_lookup_temp_dir (/opt/irstats/bin/convert_hosts_temp_files/) - The directory in which to store temp files for host lookups
  
Contains all the epstats classes.
 
  
= EPStats Classes =
+
== /opt/irstats/perl_lib ==
  
Note that the leading EPStats:: has been left out for brevity.
+
Contains all the irstats classes.
 +
 
 +
= IRStats Classes =
 +
 
 +
Note that the leading IRStats:: has been left out for brevity.
 +
 
 +
== Configuration ==
 +
This object acts as an interface to the configuration file.
 +
=== Configuration Contstants ===
 +
*$configuration_file - The path to the configuration file.
 +
=== Functions ===
 +
*new - Parses the configuration file and returns a new object.
 +
*get_value(config_id) - Returns a value.
  
 
== Params ==
 
== Params ==
This object holds the parameters that are used to generate the statistics.  The most imortant of these are a date range and an eprint set.
+
This object holds the parameters that are used to generate the statistics.  This is passed around the system.
 
=== Configuration Constants ===
 
=== Configuration Constants ===
*$cgi_script - the name of the cgi script (currently unused)
 
*$id_params - When generating an ID, which parameters are important.
 
 
*$defaults - Any default parameters you wish to set.
 
*$defaults - Any default parameters you wish to set.
  
 
=== Functions ===
 
=== Functions ===
*new(CGI_object) - returns new object
+
*new(Configuration, [ CGI_object | params_hash ]) - returns new object
 
*mask(params_hash) - used when you want to temporarily overwrite parameter(s).  Overwrites values with contents of params_hash.  Overwritten values get pushed onto a stack.
 
*mask(params_hash) - used when you want to temporarily overwrite parameter(s).  Overwrites values with contents of params_hash.  Overwritten values get pushed onto a stack.
 
*unmask - Sets parameters back to how they were before the last mask.
 
*unmask - Sets parameters back to how they were before the last mask.
*generate_cgi - returns a string containing the name of the cgi script, and all parameters, to enable the creation of links. (currently unused)
 
 
*get(param_name) - returns the value of a single parameter.
 
*get(param_name) - returns the value of a single parameter.
 
*create_id - Uses MD5 to create a unique ID from the id_params (see Constants above).  This is called whenever get('id') is called.
 
*create_id - Uses MD5 to create a unique ID from the id_params (see Constants above).  This is called whenever get('id') is called.
Line 55: Line 110:
 
== DatabaseInterface ==
 
== DatabaseInterface ==
 
This object does what it says on the tin.  Any access to the database is done though it.
 
This object does what it says on the tin.  Any access to the database is done though it.
 
<strong>IMPORTANT</strong> - the mysql generated has been developed on a machine running mysql 5.  Installing on the EPrints server has broken this (as it's running mysql 4).  I placed a quick and dirty hack into the do_sql function, and modified the create_top_table function.  I have no idea if this works well.  <strong>IT NEEDS TO BE CHECKED</strong>.
 
 
=== Configuration Constants ===
 
Constants are contained in the new function.
 
 
*DBI Configuration Constants - $driver, $server, $database, $user, $password are all used to create the connection to the database.
 
*source_table - The table in which the stats are stored.
 
  
 
=== Functions ===
 
=== Functions ===
*new() - returns object.
+
*new(Configuration) - returns object.
*retreive_set_names() - returns a list of eprint sets.  Currently 'group' and 'author' are implemented.  This is used to verify cgi input.
+
*retreive_set_names() - returns a list of eprint sets.  This can be used to verify cgi input.
 
*get_membership(eprint_id, set_name) - For a given eprint ID, which of a named set does it belong to.  For example, we can find out which authors eprint 12614 has by get_membership(12614, 'author').
 
*get_membership(eprint_id, set_name) - For a given eprint ID, which of a named set does it belong to.  For example, we can find out which authors eprint 12614 has by get_membership(12614, 'author').
 
*get_citation(id, set, length) - returns a citation.  Every set member (eprint, author, group) has two citations.  short and full.  We only return a short citation if length == 'short'.  So, to get the short citation of a group 3: get_citation(3,'group','short').
 
*get_citation(id, set, length) - returns a citation.  Every set member (eprint, author, group) has two citations.  short and full.  We only return a short citation if length == 'short'.  So, to get the short citation of a group 3: get_citation(3,'group','short').
*get_code($id,$set) - UNWRITTEN - Set member have codes.  This how they are identified by the user.  For example author_lac is the member of the author set whose code is lac.  To get the code for group 3: get_code(3,'group').
+
*get_stats(params_object, query_params_hash) - returns a dbi object containing the stats we are interested.  i.e. the params_object's date range and eprints sets, and only the columns in query params hash.  The query params hash can contain the following key/value pairs
 
+
**columns => column_name_array - Which columns are we interested in?
When retreiving statistics, EPStats filters by inner joining the epstats_true_accesses_table to other tables contining eprint IDs.  Sometimes it has to create these tables.
+
**order => column_name - A hash containing a column name and directions (ASC or DESC)
 
 
*create_top_table(param_object) - This creates a table containing the eprint IDs of the top X by fulltext download between two dates.
 
*create_list_table(table_name, eprint_ids) - Takes two strings, one the name of the table, the other a space seperated list of eprint IDs.  Creates a temporary table.
 
 
 
The following are the only two functions that actually make calls to the database.
 
 
 
*do_sql(sql_query_string) - takes a string and performs a query, returning the dbi object containing the results.
 
*insert_values(table_name, values) - inserts a row of data into a table.
 
 
 
And finally, the meat and potatoes.  The functions that return the statistics we're interested in.
 
 
 
*get_stats(params_object, column_name_list, options_hash) - returns a dbi object containing the stats we are interested.  i.e. the params_object's date range and eprints sets, and only the columns in column_list.  The options hash can contain the following key/value pairs
 
**order => column_name - the column on which to order it.  append with '-' or ' DESC' to order it descending.
 
 
**limit => int - How many results to return
 
**limit => int - How many results to return
 
**group_by => column_name - if we need to group by a column.
 
**group_by => column_name - if we need to group by a column.
get_stats works by examining the 'eprints' parameter and calling one of the following functions:
+
**where => where_hash_array - if additional logic needs to be applied, this array contains hashes containing a column name, an operator and a value.  These are ANDed together.
**get_list_stats
+
*check_tables() - If any IRStats tables are missing, this function will create them.
**get_top_stats
+
*insert_main_table_row(column_array) - inserts the values in the array into the main table (taking into account any tables that contain only IDs).
**get_set_stats
+
*do_sql(sql_query_string) - takes a string and performs a query, returning the dbi object containing the results.  This is the only point where sql is sent to the database.
**get_all_stats
 
These functions generate slightly different mysql queries, and pass them to the do_sql function.
 
  
 
== Date ==
 
== Date ==
I implemented a date object because there were some specific things I needed to do with dates.
+
A date object was implemented because there were some specific things that needed to be done with dates.
  
 
===Functions===
 
===Functions===
Line 104: Line 136:
 
*increment(period) - decrements by calling mod_date.
 
*increment(period) - decrements by calling mod_date.
 
*part(part_name, style - Returns the day, month or year.  For month, if style=='text', returns a three letter string, otherwise returns an integer.  For year, if style=='short', returns the last two digits, otherwise returns all four.
 
*part(part_name, style - Returns the day, month or year.  For month, if style=='text', returns a three letter string, otherwise returns an integer.  For year, if style=='short', returns the last two digits, otherwise returns all four.
 +
*difference(date_object) - returns the difference in days between itself and another date.
 
*less_than(date_object) - compares itself to another date object.  Returns 1 if it's less than it, otherwise returns 0.
 
*less_than(date_object) - compares itself to another date object.  Returns 1 if it's less than it, otherwise returns 0.
 
*greater_than(date_object) - compares itself to another date object.  Returns 1 if it's greater than it, otherwise returns 0.
 
*greater_than(date_object) - compares itself to another date object.  Returns 1 if it's greater than it, otherwise returns 0.
Line 109: Line 142:
 
*render(format_string) - returns a date string.  Format can be:
 
*render(format_string) - returns a date string.  Format can be:
 
**'short' - Calls render_abbreviated - returns a date like this: 05-Jul-77
 
**'short' - Calls render_abbreviated - returns a date like this: 05-Jul-77
**'long' - Calls render_full (not implemented).
 
 
**'numerical' (default) - Calls render_numerical - returns a date like this: 19770705
 
**'numerical' (default) - Calls render_numerical - returns a date like this: 19770705
 
*clone - returns an new, identical date object.
 
*clone - returns an new, identical date object.
 
  
 
== Cache ==
 
== Cache ==
 
The interface to the cache.
 
The interface to the cache.
 
=== Configuration Constants ===
 
*$cache_directory - a string containing a path to the directory in which the cache files are located.
 
  
 
=== Functions ===
 
=== Functions ===
Line 131: Line 159:
 
*new(start_date_obj, end_date_obj) - doesn't do anything, just returns the object.
 
*new(start_date_obj, end_date_obj) - doesn't do anything, just returns the object.
  
The following functions all return an array of hashes.  Each hash has the keys 'start_date' and 'end_date', and the values are both EPStats::Date objects.
+
The following functions all return an array of hashes.  Each hash has the keys 'start_date' and 'end_date', and the values are both IRStats::Date objects.
  
 
*calandar_months - Returns full months (each element starts on the 1st, and ends on the last day).
 
*calandar_months - Returns full months (each element starts on the 1st, and ends on the last day).
Line 139: Line 167:
  
 
== UserInterface::Controls ==
 
== UserInterface::Controls ==
This is used to generate the drop boxes in the stats cgi script. If I had more time I'd document it fully, but my daughter's going to be born in less than 12 hours.
+
This is used to generate the drop boxes in the stats cgi script.
 
+
===Functions===
== Page (depricated) ==
+
new(params_obj, database_interface_object) - returns the object.
Harkens back to the day when a page object contained views.
+
start_date_control() - returns the html for the three drop-boxes for selecting the year, month and day of the start date.
 +
end_date_control() - return the html for the three drop-boxes for selecting the year, month and day of the end date.
 +
eprint_control() - returns the html for the eprints text box.
 +
drop_box(id, contents_array) - returns the html for a drop box containing what is in the array (each array element is a hash containing 'value' and 'display').
  
 
== View ==
 
== View ==
A view processes the stats data filtered by the parameters and creates a visualisation.
+
A view processes the stats data filtered by the parameters and creates a visualisation.  It is intended that savvy users create their own views.
  
 
=== Functions ===
 
=== Functions ===
Line 154: Line 185:
 
*new - passes arguments to superclass, then calls 'initialise'.
 
*new - passes arguments to superclass, then calls 'initialise'.
 
*initialise - the Configuration Constants are set here.
 
*initialise - the Configuration Constants are set here.
*populate - The engine that powers EPStats.
+
*populate - The engine that powers IRStats.
  
=== View::FullTextCountHTML ===
+
=== View::DownloadCountHTML ===
The FullTextCountHTML is an extremely simple view.  It retrieves one row from the database and does no processing.
+
The DownloadCountHTML is an extremely simple view.  It retrieves one row from the database and does no processing, making it ideal for a quick walkthrough:
  
 
==== Housekeeping ====
 
==== Housekeeping ====
 
At the top of the file, we need:
 
At the top of the file, we need:
  package EPStats::View::FullTextCountHTML;
+
  package IRStats::View::DownloadCountHTML;
 
  use strict;
 
  use strict;
 
  use warnings;
 
  use warnings;
 
Now, which modules will we use.  I've included perchardir, the graph making package, even though we're not using it.
 
Now, which modules will we use.  I've included perchardir, the graph making package, even though we're not using it.
  use EPStats::DatabaseInterface;
+
  use IRStats::DatabaseInterface;
  use EPStats::Cache;
+
  use IRStats::Cache;
  use EPStats::Visualisation::HTML;
+
  use IRStats::Visualisation::HTML;
  use EPStats::View;
+
  use IRStats::View;
 
  use perlchartdir;
 
  use perlchartdir;
 
And link to superclass.  
 
And link to superclass.  
  our @ISA = qw/ EPStats::View /;
+
  our @ISA = qw/ IRStats::View /;
  
 
==== Configuration Constants ====
 
==== Configuration Constants ====
We are interested in retreiving the fulltxt column, and a count as we will be aggregating.  The sql_params are set, so that we can filter on fulltext downloads, and we need to group as we are counting. We also create our visualisation here.
+
We aren't actually interested in any columns, just in the count, but we put that in the columns array anyway.
 +
We also create our visualisation here.
 
  sub initialise
 
  sub initialise
 
  {
 
  {
 
         my ($self) = @_;
 
         my ($self) = @_;
        $self->{'sql_columns'} = [ 'fulltxt', 'COUNT(fulltxt)' ];
+
         $self->{'sql_params'} = {columns => [ 'COUNT' ]};
         $self->{'sql_params'} = {where => "fulltxt = 'F'", group_by => 'fulltxt'};
+
         $self->{'visualisation'} = IRStats::Visualisation::HTML->new();
         $self->{'visualisation'} = EPStats::Visualisation::HTML->new();
 
 
  }
 
  }
  
Line 194: Line 225:
  
 
==== populate ====
 
==== populate ====
Every populate function should start by checking the cache.
+
Almost every populate function should start by checking the cache.
  
 
  sub populate
 
  sub populate
 
  {
 
  {
 
     my ($self) = @_;
 
     my ($self) = @_;
     my $cache = EPStats::Cache->new($self->{'params'}->get('id'));
+
     my $cache = IRStats::Cache->new($self->{'params'}->get('id'));
 
     if ($cache->exists)
 
     if ($cache->exists)
 
     {
 
     {
Line 209: Line 240:
 
     my $query = $self->{'database'}->get_stats(
 
     my $query = $self->{'database'}->get_stats(
 
             $self->{'params'},
 
             $self->{'params'},
            $self->{'sql_columns'},
 
 
             $self->{'sql_params'}
 
             $self->{'sql_params'}
 
             );
 
             );
 
Now we process them.  In this case, we don't even need a loop as we know there's only going to be one row.  We'll stick the result straight into some html, and save it.  Don't forget that if there isn't any data, you still have to output something.
 
Now we process them.  In this case, we don't even need a loop as we know there's only going to be one row.  We'll stick the result straight into some html, and save it.  Don't forget that if there isn't any data, you still have to output something.
 
     my @row = $query->fetchrow_array();
 
     my @row = $query->fetchrow_array();
     my $html = '<span class="epstats_view_fulltextcounthtml">' . ($row[1] ? $row[1] : '0') . "</span>";
+
     my $html = '<span class="irstats_view_fulltextcounthtml">' . ($row[1] ? $row[1] : '0') . "</span>";
 
A little housekeeping:
 
A little housekeeping:
 
     $query->finish();
 
     $query->finish();
Line 222: Line 252:
 
     $cache->write($self->{'visualisation'});
 
     $cache->write($self->{'visualisation'});
 
  }
 
  }
 +
 
And that's a really simple view.
 
And that's a really simple view.
 +
 +
=== Using Periods ===
 +
 +
If we wanted to break our daterange into periods, we'd need to do something like this:
 +
 +
my $periods = IRStats::Periods->new($self->{'params'}->{'start_date'},$self->{'params'}->{'end_date'});
 +
foreach my $period ( @{$periods->calandar_months()} )
 +
{
 +
    $self->{'params'}->mask($period);
 +
    my $query = $self->{'database'}->get_stats(
 +
            $self->{'params'},
 +
            $self->{'sql_params'}
 +
    );
 +
    $self->{'params'}->unmask();
 +
    #process and put into variables
 +
}
  
 
== Visualisation ==
 
== Visualisation ==
Line 266: Line 313:
 
An HTML table that is rendered in several columns.
 
An HTML table that is rendered in several columns.
 
==== Configuration Constants ====
 
==== Configuration Constants ====
$default_number_of_rows - an int representing the maximum number of rows the table should have.
+
$default_number_of_rows - an int representing the maximum number of rows the table should have.  This is to prevent sending huge tables to browsers which may not be able to handle it.
  
 
==== Overridden Functions ====
 
==== Overridden Functions ====

Latest revision as of 16:43, 8 August 2019

This is IRStats 1 documentation. IRStats 1 is now out of support. You may have been looking for IRStats2

Directory Structure

/opt/irstats/bin

Contains the scripts needed to update the table.

  • daily_update.sh - Runs all the scripts in the right order.
  • extract_metadata_from_archive.pl - Extracts eprint, author and group metadata from the repository by iterating over every eprint.
  • update_table.pl - Filters and processes new entries in the accesslog to update the irstats_true_acesses_table. Uses 'SearchParser.pm' and 'repeatscache'.
  • convert_ip_to_host.pl - Attempts to convert ip addresses of the new entries in irstats_true_acesses_table to hostnames. Uses 'host_updated' to keep track of where it got to last time.

Note that most of these scripts probably need to be tidied up. They were written in a hurry and were never polished.

/opt/irstats/cache

Contains cache files. These should probably be deleted whenever the database is updated.

/opt/irstats/cgi

Contains two scripts, 'get_view and 'stats'.

  • get_view returns the output of a IRStats::View (see below), which is currently a chunk of html or csv, but could be almost anything.
  • stats is a handy cgi form that passes arguements to get_view

/opt/irstats/img

Conceptually, where any images would be kept (e.g. national flags). At the moment, only the img/graphs directory is used. This is where generated graphs are stored.

/opt/irstats/cfg

Where the configuration file and the text files containing repository data are held.

The Configuration File

irstats.cfg contains a number of configuration strings. Here are some of the more important ones, with the default in brackets:

  • configuration_path (/opt/irstats/cfg/) - The path of the configuration directory.
  • view_path (/opt/irstats/perl_lib/IRStats/View/) - The directory containing the Views.
  • cache_path (/opt/irstats/cache/) - The directory in which to store cache files.
  • graph_path (/opt/irstats/img/graphs/) - The directory in which to store graph images.
  • graph_relative_url_path (/img/graphs/) - The url of the directory in which the graph file is from the point of view of the web browser.
  • update_lock_filename (/opt/irstats/bin/.lock) - The name of the file that is created to prevent the update process running twice concurrently
  • The names of the files used to store set information
    • set_member_full_citations_file (/opt/irstats/cfg/irstats_set_member_full_citations.txt)
    • set_member_short_citations_file (/opt/irstats/cfg/irstats_set_member_short_citations.txt)
    • set_membership_file (/opt/irstats/cfg/irstats_set_membership.txt)
    • set_member_codes_file (/opt/irstats/cfg/irstats_set_member_codes.txt)
    • set_member_urls_file (/opt/irstats/cfg/irstats_set_member_urls.txt)
  • Referrer Scope Labels (note, if you change these, you should also change them in the database)
    • referrer_scope_1 (Internal)
    • referrer_scope_2 (ECS)
    • referrer_scope_3 (Search)
    • referrer_scope_4 (External)
    • referrer_scope_no_referrer (None)
  • awstats_search_engines (/usr/local/awstats/wwwroot/cgi-bin/lib/search_engines.pm) - The path to the awstats search engine module
  • repeats_filter_file (/opt/irstats/bin/repeatscache) - The file to maintain state between updates
  • repeats_filter_timeout (86400) - repeat timeout in seconds (the amount of time there needs to be between two hits for them both to be recorded, initially set to 60*60*24)
  • database configuration
    • database_driver (mysql)
    • database_server (localhost)
    • database_name
    • database_user
    • database_password
  • database_id_columns ([ requester_organisation, requester_host, referrer_scope, search_engine, search_terms, referring_entity_id ]) - The columns in the database that have a UID rather than data. These need seperate tables in which to store the data.
  • Various table names and parts of names
    • database_eprints_access_log_table (accesslog) ##Perhaps remove after update rewrite.
    • database_main_stats_table (irstats_true_accesses_table)
    • database_column_table_prefix (irstats_column_)
    • database_set_table_prefix (irstats_set_)
    • database_set_table_code_suffix (_code)
    • database_set_table_citation_suffix (_citation)
  • id_parameters ([ start_date, end_date, eprints, view ]) - the parameters that are used to uniquely identify a view
  • host_lookup_temp_dir (/opt/irstats/bin/convert_hosts_temp_files/) - The directory in which to store temp files for host lookups


/opt/irstats/perl_lib

Contains all the irstats classes.

IRStats Classes

Note that the leading IRStats:: has been left out for brevity.

Configuration

This object acts as an interface to the configuration file.

Configuration Contstants

  • $configuration_file - The path to the configuration file.

Functions

  • new - Parses the configuration file and returns a new object.
  • get_value(config_id) - Returns a value.

Params

This object holds the parameters that are used to generate the statistics. This is passed around the system.

Configuration Constants

  • $defaults - Any default parameters you wish to set.

Functions

  • new(Configuration, [ CGI_object | params_hash ]) - returns new object
  • mask(params_hash) - used when you want to temporarily overwrite parameter(s). Overwrites values with contents of params_hash. Overwritten values get pushed onto a stack.
  • unmask - Sets parameters back to how they were before the last mask.
  • get(param_name) - returns the value of a single parameter.
  • create_id - Uses MD5 to create a unique ID from the id_params (see Constants above). This is called whenever get('id') is called.

DatabaseInterface

This object does what it says on the tin. Any access to the database is done though it.

Functions

  • new(Configuration) - returns object.
  • retreive_set_names() - returns a list of eprint sets. This can be used to verify cgi input.
  • get_membership(eprint_id, set_name) - For a given eprint ID, which of a named set does it belong to. For example, we can find out which authors eprint 12614 has by get_membership(12614, 'author').
  • get_citation(id, set, length) - returns a citation. Every set member (eprint, author, group) has two citations. short and full. We only return a short citation if length == 'short'. So, to get the short citation of a group 3: get_citation(3,'group','short').
  • get_stats(params_object, query_params_hash) - returns a dbi object containing the stats we are interested. i.e. the params_object's date range and eprints sets, and only the columns in query params hash. The query params hash can contain the following key/value pairs
    • columns => column_name_array - Which columns are we interested in?
    • order => column_name - A hash containing a column name and directions (ASC or DESC)
    • limit => int - How many results to return
    • group_by => column_name - if we need to group by a column.
    • where => where_hash_array - if additional logic needs to be applied, this array contains hashes containing a column name, an operator and a value. These are ANDed together.
  • check_tables() - If any IRStats tables are missing, this function will create them.
  • insert_main_table_row(column_array) - inserts the values in the array into the main table (taking into account any tables that contain only IDs).
  • do_sql(sql_query_string) - takes a string and performs a query, returning the dbi object containing the results. This is the only point where sql is sent to the database.

Date

A date object was implemented because there were some specific things that needed to be done with dates.

Functions

  • new(date_hash) - Creates a new date object when passed a hash with the keys 'day', 'month' and 'year'.
  • validate() - If the date is not valid, it will be modified to a sensible value. E.G. if it's Feb 30th, it will be modified to Feb 29th or 28th, dependant on if it's a leap year.
  • set(part_name, int) - Sets part of the date ('year','month' or 'day') to a specific value.
  • decrement(period) - increments the date by a period ('day', 'week', 'month', 'quarter', 'year'). Calls the mod_date function, which does the muscle work.
  • increment(period) - decrements by calling mod_date.
  • part(part_name, style - Returns the day, month or year. For month, if style=='text', returns a three letter string, otherwise returns an integer. For year, if style=='short', returns the last two digits, otherwise returns all four.
  • difference(date_object) - returns the difference in days between itself and another date.
  • less_than(date_object) - compares itself to another date object. Returns 1 if it's less than it, otherwise returns 0.
  • greater_than(date_object) - compares itself to another date object. Returns 1 if it's greater than it, otherwise returns 0.
  • month_name() - returns the three letter string of the month.
  • render(format_string) - returns a date string. Format can be:
    • 'short' - Calls render_abbreviated - returns a date like this: 05-Jul-77
    • 'numerical' (default) - Calls render_numerical - returns a date like this: 19770705
  • clone - returns an new, identical date object.

Cache

The interface to the cache.

Functions

  • new(id) - takes the ID of the params object we're using at the moment.
  • exists() - returns true if there's a cached file, false if there isn't one.
  • write(visualisation_object) - writes the data to the cache file.
  • read() - returns the data from the cache.

Periods

The Periods object is used when you want to break a daterange down into sub-ranges. Used with the params->mask() function, stats can be retrieved for periods inside a date range.

Functions

  • new(start_date_obj, end_date_obj) - doesn't do anything, just returns the object.

The following functions all return an array of hashes. Each hash has the keys 'start_date' and 'end_date', and the values are both IRStats::Date objects.

  • calandar_months - Returns full months (each element starts on the 1st, and ends on the last day).
  • months - Returns month periods (if the start_date is the 15th, then each period starts on the 15th and ends on the 14th of the next month - except the last period, which only has about a 1/30 chance of doing so).
  • weeks - returns 7-day periods (except the last, which has a 1/7 chance of being 7 days long).
  • days - returns single days (for each period, the start_date and end_date are the same).

UserInterface::Controls

This is used to generate the drop boxes in the stats cgi script.

Functions

new(params_obj, database_interface_object) - returns the object. start_date_control() - returns the html for the three drop-boxes for selecting the year, month and day of the start date. end_date_control() - return the html for the three drop-boxes for selecting the year, month and day of the end date. eprint_control() - returns the html for the eprints text box. drop_box(id, contents_array) - returns the html for a drop box containing what is in the array (each array element is a hash containing 'value' and 'display').

View

A view processes the stats data filtered by the parameters and creates a visualisation. It is intended that savvy users create their own views.

Functions

All views inherit:

  • new(params_obj, database_interface_object) - returns the object.
  • render - calls populate, then returns whatever the visualisation renders

All visualisations must implement:

  • new - passes arguments to superclass, then calls 'initialise'.
  • initialise - the Configuration Constants are set here.
  • populate - The engine that powers IRStats.

View::DownloadCountHTML

The DownloadCountHTML is an extremely simple view. It retrieves one row from the database and does no processing, making it ideal for a quick walkthrough:

Housekeeping

At the top of the file, we need:

package IRStats::View::DownloadCountHTML;
use strict;
use warnings;

Now, which modules will we use. I've included perchardir, the graph making package, even though we're not using it.

use IRStats::DatabaseInterface;
use IRStats::Cache;
use IRStats::Visualisation::HTML;
use IRStats::View;
use perlchartdir;

And link to superclass.

our @ISA = qw/ IRStats::View /;

Configuration Constants

We aren't actually interested in any columns, just in the count, but we put that in the columns array anyway. We also create our visualisation here.

sub initialise
{
       my ($self) = @_;
       $self->{'sql_params'} = {columns => [ 'COUNT' ]};
       $self->{'visualisation'} = IRStats::Visualisation::HTML->new();
}

new

The new function shouldn't ever need to be any different from this:

sub new
{
       my( $class, $params, $database ) = @_;
       my $self = $class->SUPER::new($params, $database);;
       $self->initialise();
       return $self;
}

populate

Almost every populate function should start by checking the cache.

sub populate
{
   my ($self) = @_;
   my $cache = IRStats::Cache->new($self->{'params'}->get('id'));
   if ($cache->exists)
   {
       $self->{'visualisation'} = $cache->read();
       return;
   }

Next, we have to retreive from the database:

   my $query = $self->{'database'}->get_stats(
           $self->{'params'},
           $self->{'sql_params'}
           );

Now we process them. In this case, we don't even need a loop as we know there's only going to be one row. We'll stick the result straight into some html, and save it. Don't forget that if there isn't any data, you still have to output something.

   my @row = $query->fetchrow_array();
   my $html = '' . ($row[1] ? $row[1] : '0') . "";

A little housekeeping:

   $query->finish();

Pop the data into the visualisation:

   $self->{'visualisation'}->set('html',$html);

Finally, we should write to the cache so we don't have to query the database next time.

   $cache->write($self->{'visualisation'});
}

And that's a really simple view.

Using Periods

If we wanted to break our daterange into periods, we'd need to do something like this:

my $periods = IRStats::Periods->new($self->{'params'}->{'start_date'},$self->{'params'}->{'end_date'});
foreach my $period ( @{$periods->calandar_months()} )
{
   $self->{'params'}->mask($period);
   my $query = $self->{'database'}->get_stats(
           $self->{'params'},
           $self->{'sql_params'}
   );
   $self->{'params'}->unmask();
   #process and put into variables
}

Visualisation

Currently Visualisations are Graph, Table or HTML. These are what the user will look at in the broswer or download (in the case of CSV).

Functions

All visualisations inherit:

  • new(data_hash) - a hash can optionally be passed containing the values that would otherwise be set using the 'set' function.
  • set(param_name, value) - sets something to something - see subclasses

All visualisations must implement:

  • render() - returns what will be passed to the script.


Visualisation::HTML

The simplest visualisation. Just a chunk of html.

To Populate:

  • set('html', html_string) - takes the html as a string.

Visualisation::Table

The Visualisation::Table currently just passes the buck to its superclass.

There are currently three table Visualisations:

Visualisation::Table::CSV

Returns a CSV table.

To Populate:

  • set('headings', headings_arrayref) - pass an array containing headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.

Visualisation::Table::HTML

A basic HTML table.

To Populate:

  • set('columns', headings_arrayref) - pass an array containing column headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.

And then optionally

  • set('totals', totals_arrayref) - an array of totals to put at the bottom of the table.

Visualisation::Table::HTML_Columned

An HTML table that is rendered in several columns.

Configuration Constants

$default_number_of_rows - an int representing the maximum number of rows the table should have. This is to prevent sending huge tables to browsers which may not be able to handle it.

Overridden Functions

  • new(data_hash, number_of_rows) - Both data_hash and number_of_rows are optional. Both can be set with 'set'.

To Populate:

  • set('columns', headings_arrayref) - pass an array containing column headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.
  • set('number_of_rows', int) - set the maximum number of rows the table should have.

Visualisation::Graph

The graph objects all use Chart Director to generate graphs. The Graph object initialised the colours that the graph may be using.

Every graph must be created with at least the filename:

  • new({filename => string}) - the filename comes from the ID of the param object.

Configuration Constants

These are set in the 'new' function.

  • $graph_dir - the path to the directory where the image file will be saved.
  • $url_relative - this will have the filename added to the end and put in the img html tag.

Sub Classes

Note that in the Visualisation/Graph/ directory, there is 'GraphLegend.pm'. This is used to create the html for the graph legends.

Visualisation::Graph::Bar.pm

A Bar Graph. It can have one or more bars in each division of the x axis.

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('x_title',string) - The title of the x axis.
  • set('y_title',string) - The title of the y axis.
  • set('x_labels',array_ref) - an array containing the labels for the x axis
  • set('data_series, array_ref) - an array of arrayrefs, referencing data for each set of bars

Visualisation::Graph::Line.pm

A Line Graph. There can be many lines on it

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('x_title',string) - The title of the x axis.
  • set('y_title',string) - The title of the y axis.
  • set('x_labels',array_ref) - an array containing the labels for the x axis
  • set('data_series, array_ref) - an array of arrayrefs, referencing data for each line

Visualisation::Graph::Pie.pm

A Pie Graph

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('data_series, array_ref) - an array of hashrefs, {data => int, citation => string}, one for each slice