IRStats Technical Documentation

From Eprints Documentation

Jump to: navigation, search

Contents

Directory Structure

/opt/irstats/bin

Contains the scripts needed to update the table.

  • daily_update.sh - Runs all the scripts in the right order.
  • extract_metadata_from_archive.pl - Extracts eprint, author and group metadata from the repository by iterating over every eprint.
  • update_table.pl - Filters and processes new entries in the accesslog to update the irstats_true_acesses_table. Uses 'SearchParser.pm' and 'repeatscache'.
  • convert_ip_to_host.pl - Attempts to convert ip addresses of the new entries in irstats_true_acesses_table to hostnames. Uses 'host_updated' to keep track of where it got to last time.

Note that most of these scripts probably need to be tidied up. They were written in a hurry and were never polished.

/opt/irstats/cache

Contains cache files. These should probably be deleted whenever the database is updated.

/opt/irstats/cgi

Contains two scripts, 'get_view and 'stats'.

  • get_view returns the output of a IRStats::View (see below), which is currently a chunk of html or csv, but could be almost anything.
  • stats is a handy cgi form that passes arguements to get_view

/opt/irstats/img

Conceptually, where any images would be kept (e.g. national flags). At the moment, only the img/graphs directory is used. This is where generated graphs are stored.

/opt/irstats/cfg

Where the configuration file and the text files containing repository data are held.

The Configuration File

irstats.cfg contains a number of configuration strings. Here are some of the more important ones, with the default in brackets:

  • configuration_path (/opt/irstats/cfg/) - The path of the configuration directory.
  • view_path (/opt/irstats/perl_lib/IRStats/View/) - The directory containing the Views.
  • cache_path (/opt/irstats/cache/) - The directory in which to store cache files.
  • graph_path (/opt/irstats/img/graphs/) - The directory in which to store graph images.
  • graph_relative_url_path (/img/graphs/) - The url of the directory in which the graph file is from the point of view of the web browser.
  • update_lock_filename (/opt/irstats/bin/.lock) - The name of the file that is created to prevent the update process running twice concurrently
  • The names of the files used to store set information
    • set_member_full_citations_file (/opt/irstats/cfg/irstats_set_member_full_citations.txt)
    • set_member_short_citations_file (/opt/irstats/cfg/irstats_set_member_short_citations.txt)
    • set_membership_file (/opt/irstats/cfg/irstats_set_membership.txt)
    • set_member_codes_file (/opt/irstats/cfg/irstats_set_member_codes.txt)
    • set_member_urls_file (/opt/irstats/cfg/irstats_set_member_urls.txt)
  • Referrer Scope Labels (note, if you change these, you should also change them in the database)
    • referrer_scope_1 (Internal)
    • referrer_scope_2 (ECS)
    • referrer_scope_3 (Search)
    • referrer_scope_4 (External)
    • referrer_scope_no_referrer (None)
  • awstats_search_engines (/usr/local/awstats/wwwroot/cgi-bin/lib/search_engines.pm) - The path to the awstats search engine module
  • repeats_filter_file (/opt/irstats/bin/repeatscache) - The file to maintain state between updates
  • repeats_filter_timeout (86400) - repeat timeout in seconds (the amount of time there needs to be between two hits for them both to be recorded, initially set to 60*60*24)
  • database configuration
    • database_driver (mysql)
    • database_server (localhost)
    • database_name
    • database_user
    • database_password
  • database_id_columns ([ requester_organisation, requester_host, referrer_scope, search_engine, search_terms, referring_entity_id ]) - The columns in the database that have a UID rather than data. These need seperate tables in which to store the data.
  • Various table names and parts of names
    • database_eprints_access_log_table (accesslog) ##Perhaps remove after update rewrite.
    • database_main_stats_table (irstats_true_accesses_table)
    • database_column_table_prefix (irstats_column_)
    • database_set_table_prefix (irstats_set_)
    • database_set_table_code_suffix (_code)
    • database_set_table_citation_suffix (_citation)
  • id_parameters ([ start_date, end_date, eprints, view ]) - the parameters that are used to uniquely identify a view
  • host_lookup_temp_dir (/opt/irstats/bin/convert_hosts_temp_files/) - The directory in which to store temp files for host lookups


/opt/irstats/perl_lib

Contains all the irstats classes.

IRStats Classes

Note that the leading IRStats:: has been left out for brevity.

Configuration

This object acts as an interface to the configuration file.

Configuration Contstants

  • $configuration_file - The path to the configuration file.

Functions

  • new - Parses the configuration file and returns a new object.
  • get_value(config_id) - Returns a value.

Params

This object holds the parameters that are used to generate the statistics. This is passed around the system.

Configuration Constants

  • $defaults - Any default parameters you wish to set.

Functions

  • new(Configuration, [ CGI_object | params_hash ]) - returns new object
  • mask(params_hash) - used when you want to temporarily overwrite parameter(s). Overwrites values with contents of params_hash. Overwritten values get pushed onto a stack.
  • unmask - Sets parameters back to how they were before the last mask.
  • get(param_name) - returns the value of a single parameter.
  • create_id - Uses MD5 to create a unique ID from the id_params (see Constants above). This is called whenever get('id') is called.

DatabaseInterface

This object does what it says on the tin. Any access to the database is done though it.

Functions

  • new(Configuration) - returns object.
  • retreive_set_names() - returns a list of eprint sets. This can be used to verify cgi input.
  • get_membership(eprint_id, set_name) - For a given eprint ID, which of a named set does it belong to. For example, we can find out which authors eprint 12614 has by get_membership(12614, 'author').
  • get_citation(id, set, length) - returns a citation. Every set member (eprint, author, group) has two citations. short and full. We only return a short citation if length == 'short'. So, to get the short citation of a group 3: get_citation(3,'group','short').
  • get_stats(params_object, query_params_hash) - returns a dbi object containing the stats we are interested. i.e. the params_object's date range and eprints sets, and only the columns in query params hash. The query params hash can contain the following key/value pairs
    • columns => column_name_array - Which columns are we interested in?
    • order => column_name - A hash containing a column name and directions (ASC or DESC)
    • limit => int - How many results to return
    • group_by => column_name - if we need to group by a column.
    • where => where_hash_array - if additional logic needs to be applied, this array contains hashes containing a column name, an operator and a value. These are ANDed together.
  • check_tables() - If any IRStats tables are missing, this function will create them.
  • insert_main_table_row(column_array) - inserts the values in the array into the main table (taking into account any tables that contain only IDs).
  • do_sql(sql_query_string) - takes a string and performs a query, returning the dbi object containing the results. This is the only point where sql is sent to the database.

Date

A date object was implemented because there were some specific things that needed to be done with dates.

Functions

  • new(date_hash) - Creates a new date object when passed a hash with the keys 'day', 'month' and 'year'.
  • validate() - If the date is not valid, it will be modified to a sensible value. E.G. if it's Feb 30th, it will be modified to Feb 29th or 28th, dependant on if it's a leap year.
  • set(part_name, int) - Sets part of the date ('year','month' or 'day') to a specific value.
  • decrement(period) - increments the date by a period ('day', 'week', 'month', 'quarter', 'year'). Calls the mod_date function, which does the muscle work.
  • increment(period) - decrements by calling mod_date.
  • part(part_name, style - Returns the day, month or year. For month, if style=='text', returns a three letter string, otherwise returns an integer. For year, if style=='short', returns the last two digits, otherwise returns all four.
  • difference(date_object) - returns the difference in days between itself and another date.
  • less_than(date_object) - compares itself to another date object. Returns 1 if it's less than it, otherwise returns 0.
  • greater_than(date_object) - compares itself to another date object. Returns 1 if it's greater than it, otherwise returns 0.
  • month_name() - returns the three letter string of the month.
  • render(format_string) - returns a date string. Format can be:
    • 'short' - Calls render_abbreviated - returns a date like this: 05-Jul-77
    • 'numerical' (default) - Calls render_numerical - returns a date like this: 19770705
  • clone - returns an new, identical date object.

Cache

The interface to the cache.

Functions

  • new(id) - takes the ID of the params object we're using at the moment.
  • exists() - returns true if there's a cached file, false if there isn't one.
  • write(visualisation_object) - writes the data to the cache file.
  • read() - returns the data from the cache.

Periods

The Periods object is used when you want to break a daterange down into sub-ranges. Used with the params->mask() function, stats can be retrieved for periods inside a date range.

Functions

  • new(start_date_obj, end_date_obj) - doesn't do anything, just returns the object.

The following functions all return an array of hashes. Each hash has the keys 'start_date' and 'end_date', and the values are both IRStats::Date objects.

  • calandar_months - Returns full months (each element starts on the 1st, and ends on the last day).
  • months - Returns month periods (if the start_date is the 15th, then each period starts on the 15th and ends on the 14th of the next month - except the last period, which only has about a 1/30 chance of doing so).
  • weeks - returns 7-day periods (except the last, which has a 1/7 chance of being 7 days long).
  • days - returns single days (for each period, the start_date and end_date are the same).

UserInterface::Controls

This is used to generate the drop boxes in the stats cgi script.

Functions

new(params_obj, database_interface_object) - returns the object. start_date_control() - returns the html for the three drop-boxes for selecting the year, month and day of the start date. end_date_control() - return the html for the three drop-boxes for selecting the year, month and day of the end date. eprint_control() - returns the html for the eprints text box. drop_box(id, contents_array) - returns the html for a drop box containing what is in the array (each array element is a hash containing 'value' and 'display').

View

A view processes the stats data filtered by the parameters and creates a visualisation. It is intended that savvy users create their own views.

Functions

All views inherit:

  • new(params_obj, database_interface_object) - returns the object.
  • render - calls populate, then returns whatever the visualisation renders

All visualisations must implement:

  • new - passes arguments to superclass, then calls 'initialise'.
  • initialise - the Configuration Constants are set here.
  • populate - The engine that powers IRStats.

View::DownloadCountHTML

The DownloadCountHTML is an extremely simple view. It retrieves one row from the database and does no processing, making it ideal for a quick walkthrough:

Housekeeping

At the top of the file, we need:

package IRStats::View::DownloadCountHTML;
use strict;
use warnings;

Now, which modules will we use. I've included perchardir, the graph making package, even though we're not using it.

use IRStats::DatabaseInterface;
use IRStats::Cache;
use IRStats::Visualisation::HTML;
use IRStats::View;
use perlchartdir;

And link to superclass.

our @ISA = qw/ IRStats::View /;

Configuration Constants

We aren't actually interested in any columns, just in the count, but we put that in the columns array anyway. We also create our visualisation here.

sub initialise
{
       my ($self) = @_;
       $self->{'sql_params'} = {columns => [ 'COUNT' ]};
       $self->{'visualisation'} = IRStats::Visualisation::HTML->new();
}

new

The new function shouldn't ever need to be any different from this:

sub new
{
       my( $class, $params, $database ) = @_;
       my $self = $class->SUPER::new($params, $database);;
       $self->initialise();
       return $self;
}

populate

Almost every populate function should start by checking the cache.

sub populate
{
   my ($self) = @_;
   my $cache = IRStats::Cache->new($self->{'params'}->get('id'));
   if ($cache->exists)
   {
       $self->{'visualisation'} = $cache->read();
       return;
   }

Next, we have to retreive from the database:

   my $query = $self->{'database'}->get_stats(
           $self->{'params'},
           $self->{'sql_params'}
           );

Now we process them. In this case, we don't even need a loop as we know there's only going to be one row. We'll stick the result straight into some html, and save it. Don't forget that if there isn't any data, you still have to output something.

   my @row = $query->fetchrow_array();
   my $html = '' . ($row[1] ? $row[1] : '0') . "";

A little housekeeping:

   $query->finish();

Pop the data into the visualisation:

   $self->{'visualisation'}->set('html',$html);

Finally, we should write to the cache so we don't have to query the database next time.

   $cache->write($self->{'visualisation'});
}

And that's a really simple view.

Using Periods

If we wanted to break our daterange into periods, we'd need to do something like this:

my $periods = IRStats::Periods->new($self->{'params'}->{'start_date'},$self->{'params'}->{'end_date'});
foreach my $period ( @{$periods->calandar_months()} )
{
   $self->{'params'}->mask($period);
   my $query = $self->{'database'}->get_stats(
           $self->{'params'},
           $self->{'sql_params'}
   );
   $self->{'params'}->unmask();
   #process and put into variables
}

Visualisation

Currently Visualisations are Graph, Table or HTML. These are what the user will look at in the broswer or download (in the case of CSV).

Functions

All visualisations inherit:

  • new(data_hash) - a hash can optionally be passed containing the values that would otherwise be set using the 'set' function.
  • set(param_name, value) - sets something to something - see subclasses

All visualisations must implement:

  • render() - returns what will be passed to the script.


Visualisation::HTML

The simplest visualisation. Just a chunk of html.

To Populate:

  • set('html', html_string) - takes the html as a string.

Visualisation::Table

The Visualisation::Table currently just passes the buck to its superclass.

There are currently three table Visualisations:

Visualisation::Table::CSV

Returns a CSV table.

To Populate:

  • set('headings', headings_arrayref) - pass an array containing headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.

Visualisation::Table::HTML

A basic HTML table.

To Populate:

  • set('columns', headings_arrayref) - pass an array containing column headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.

And then optionally

  • set('totals', totals_arrayref) - an array of totals to put at the bottom of the table.

Visualisation::Table::HTML_Columned

An HTML table that is rendered in several columns.

Configuration Constants

$default_number_of_rows - an int representing the maximum number of rows the table should have. This is to prevent sending huge tables to browsers which may not be able to handle it.

Overridden Functions

  • new(data_hash, number_of_rows) - Both data_hash and number_of_rows are optional. Both can be set with 'set'.

To Populate:

  • set('columns', headings_arrayref) - pass an array containing column headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.
  • set('number_of_rows', int) - set the maximum number of rows the table should have.

Visualisation::Graph

The graph objects all use Chart Director to generate graphs. The Graph object initialised the colours that the graph may be using.

Every graph must be created with at least the filename:

  • new({filename => string}) - the filename comes from the ID of the param object.

Configuration Constants

These are set in the 'new' function.

  • $graph_dir - the path to the directory where the image file will be saved.
  • $url_relative - this will have the filename added to the end and put in the img html tag.

Sub Classes

Note that in the Visualisation/Graph/ directory, there is 'GraphLegend.pm'. This is used to create the html for the graph legends.

Visualisation::Graph::Bar.pm

A Bar Graph. It can have one or more bars in each division of the x axis.

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('x_title',string) - The title of the x axis.
  • set('y_title',string) - The title of the y axis.
  • set('x_labels',array_ref) - an array containing the labels for the x axis
  • set('data_series, array_ref) - an array of arrayrefs, referencing data for each set of bars

Visualisation::Graph::Line.pm

A Line Graph. There can be many lines on it

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('x_title',string) - The title of the x axis.
  • set('y_title',string) - The title of the y axis.
  • set('x_labels',array_ref) - an array containing the labels for the x axis
  • set('data_series, array_ref) - an array of arrayrefs, referencing data for each line

Visualisation::Graph::Pie.pm

A Pie Graph

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('data_series, array_ref) - an array of hashrefs, {data => int, citation => string}, one for each slice
Personal tools