IRStats Technical Documentation

From EPrints Documentation
Revision as of 21:40, 29 March 2007 by Gobfrey (talk | contribs) (View::FullTextCountHTML)
Jump to: navigation, search

This document is intended as guidance to the last stage of development of EPstats.

Directory Structure

/opt/epstats

Contains data files for GeoIP. If I had had root access, I would have put them in the correct place. They are linked to from the correct place. These need regular updating, something which hasn't been implemented.

/opt/epstats/bin

Contains the scripts needed to update the table.

  • daily_update.sh - Runs all the scripts in the right order.
  • extract_metadata_from_archive.pl - Extracts eprint, author and group metadata from the repository by iterating over every eprint.
  • update_table.pl - Filters and processes new entries in the accesslog to update the epstats_true_acesses_table. Uses 'SearchParser.pm' and 'repeatscache'.
  • convert_ip_to_host.pl - Attempts to convert ip addresses of the new entries in epstats_true_acesses_table to hostnames. Uses 'host_updated' to keep track of where it got to last time.

Note that most of these scripts probably need to be tidied up. They were written in a hurry and were never polished.

/opt/epstats/cache

Contains cache files. Feel free to delete these whenever you like.

/opt/epstats/cgi

Contains two scripts, 'get_view' and 'stats'.

  • get_view returns the output of a EPstats::View (see below), which is currently a chunk of html or csv, but could be almost anything.
  • stats is a handy cgi form that passes arguements to get_view

/opt/epstats/img

Conceptually, where any images would be kept (e.g. national flags). At the moment, only the img/graphs directory is used. This is where generated graphs are stored.

/opt/epstats/perl_lib

Contains all the epstats classes.

EPStats Classes

Note that the leading EPStats:: has been left out for brevity.

Params

This object holds the parameters that are used to generate the statistics. The most imortant of these are a date range and an eprint set.

Configuration Constants

  • $cgi_script - the name of the cgi script (currently unused)
  • $id_params - When generating an ID, which parameters are important.
  • $defaults - Any default parameters you wish to set.

Functions

  • new(CGI_object) - returns new object
  • mask(params_hash) - used when you want to temporarily overwrite parameter(s). Overwrites values with contents of params_hash. Overwritten values get pushed onto a stack.
  • unmask - Sets parameters back to how they were before the last mask.
  • generate_cgi - returns a string containing the name of the cgi script, and all parameters, to enable the creation of links. (currently unused)
  • get(param_name) - returns the value of a single parameter.
  • create_id - Uses MD5 to create a unique ID from the id_params (see Constants above). This is called whenever get('id') is called.

DatabaseInterface

This object does what it says on the tin. Any access to the database is done though it.

IMPORTANT - the mysql generated has been developed on a machine running mysql 5. Installing on the EPrints server has broken this (as it's running mysql 4). I placed a quick and dirty hack into the do_sql function, and modified the create_top_table function. I have no idea if this works well. IT NEEDS TO BE CHECKED.

Configuration Constants

Constants are contained in the new function.

  • DBI Configuration Constants - $driver, $server, $database, $user, $password are all used to create the connection to the database.
  • source_table - The table in which the stats are stored.

Functions

  • new() - returns object.
  • retreive_set_names() - returns a list of eprint sets. Currently 'group' and 'author' are implemented. This is used to verify cgi input.
  • get_membership(eprint_id, set_name) - For a given eprint ID, which of a named set does it belong to. For example, we can find out which authors eprint 12614 has by get_membership(12614, 'author').
  • get_citation(id, set, length) - returns a citation. Every set member (eprint, author, group) has two citations. short and full. We only return a short citation if length == 'short'. So, to get the short citation of a group 3: get_citation(3,'group','short').
  • get_code($id,$set) - UNWRITTEN - Set member have codes. This how they are identified by the user. For example author_lac is the member of the author set whose code is lac. To get the code for group 3: get_code(3,'group').

When retreiving statistics, EPStats filters by inner joining the epstats_true_accesses_table to other tables contining eprint IDs. Sometimes it has to create these tables.

  • create_top_table(param_object) - This creates a table containing the eprint IDs of the top X by fulltext download between two dates.
  • create_list_table(table_name, eprint_ids) - Takes two strings, one the name of the table, the other a space seperated list of eprint IDs. Creates a temporary table.

The following are the only two functions that actually make calls to the database.

  • do_sql(sql_query_string) - takes a string and performs a query, returning the dbi object containing the results.
  • insert_values(table_name, values) - inserts a row of data into a table.

And finally, the meat and potatoes. The functions that return the statistics we're interested in.

  • get_stats(params_object, column_name_list, options_hash) - returns a dbi object containing the stats we are interested. i.e. the params_object's date range and eprints sets, and only the columns in column_list. The options hash can contain the following key/value pairs
    • order => column_name - the column on which to order it. append with '-' or ' DESC' to order it descending.
    • limit => int - How many results to return
    • group_by => column_name - if we need to group by a column.

get_stats works by examining the 'eprints' parameter and calling one of the following functions:

    • get_list_stats
    • get_top_stats
    • get_set_stats
    • get_all_stats

These functions generate slightly different mysql queries, and pass them to the do_sql function.

Date

I implemented a date object because there were some specific things I needed to do with dates.

Functions

  • new(date_hash) - Creates a new date object when passed a hash with the keys 'day', 'month' and 'year'.
  • validate() - If the date is not valid, it will be modified to a sensible value. E.G. if it's Feb 30th, it will be modified to Feb 29th or 28th, dependant on if it's a leap year.
  • set(part_name, int) - Sets part of the date ('year','month' or 'day') to a specific value.
  • decrement(period) - increments the date by a period ('day', 'week', 'month', 'quarter', 'year'). Calls the mod_date function, which does the muscle work.
  • increment(period) - decrements by calling mod_date.
  • part(part_name, style - Returns the day, month or year. For month, if style=='text', returns a three letter string, otherwise returns an integer. For year, if style=='short', returns the last two digits, otherwise returns all four.
  • less_than(date_object) - compares itself to another date object. Returns 1 if it's less than it, otherwise returns 0.
  • greater_than(date_object) - compares itself to another date object. Returns 1 if it's greater than it, otherwise returns 0.
  • month_name() - returns the three letter string of the month.
  • render(format_string) - returns a date string. Format can be:
    • 'short' - Calls render_abbreviated - returns a date like this: 05-Jul-77
    • 'long' - Calls render_full (not implemented).
    • 'numerical' (default) - Calls render_numerical - returns a date like this: 19770705
  • clone - returns an new, identical date object.


Cache

The interface to the cache.

Configuration Constants

  • $cache_directory - a string containing a path to the directory in which the cache files are located.

Functions

  • new(id) - takes the ID of the params object we're using at the moment.
  • exists() - returns true if there's a cached file, false if there isn't one.
  • write(visualisation_object) - writes the data to the cache file.
  • read() - returns the data from the cache.

Periods

The Periods object is used when you want to break a daterange down into sub-ranges. Used with the params->mask() function, stats can be retrieved for periods inside a date range.

Functions

  • new(start_date_obj, end_date_obj) - doesn't do anything, just returns the object.

The following functions all return an array of hashes. Each hash has the keys 'start_date' and 'end_date', and the values are both EPStats::Date objects.

  • calandar_months - Returns full months (each element starts on the 1st, and ends on the last day).
  • months - Returns month periods (if the start_date is the 15th, then each period starts on the 15th and ends on the 14th of the next month - except the last period, which only has about a 1/30 chance of doing so).
  • weeks - returns 7-day periods (except the last, which has a 1/7 chance of being 7 days long).
  • days - returns single days (for each period, the start_date and end_date are the same).

UserInterface::Controls

This is used to generate the drop boxes in the stats cgi script. If I had more time I'd document it fully, but my daughter's going to be born in less than 12 hours.

Page (depricated)

Harkens back to the day when a page object contained views.

View

A view processes the stats data filtered by the parameters and creates a visualisation.

Functions

All views inherit:

  • new(params_obj, database_interface_object) - returns the object.
  • render - calls populate, then returns whatever the visualisation renders

All visualisations must implement:

  • new - passes arguments to superclass, then calls 'initialise'.
  • initialise - the Configuration Constants are set here.
  • populate - The engine that powers EPStats.

View::FullTextCountHTML

The FullTextCountHTML is an extremely simple view. It retrieves one row from the database and does no processing.

At the top of the file, we need:

package EPStats::View::FullTextCountHTML;
use strict;
use warnings;

Now, which modules will we use. I've included perchardir, the graph making package, even though we're not using it.

use EPStats::DatabaseInterface;
use EPStats::Cache;
use EPStats::Visualisation::HTML;
use EPStats::View;
use perlchartdir;

And link to superclass.

our @ISA = qw/ EPStats::View /;

Configuration Constants

We are interested in retreiving the fulltxt column, and a count as we will be aggregating. The sql_params are set, so that we can filter on fulltext downloads, and we need to group as we are counting. We also create our visualisation here.

sub initialise
{
       my ($self) = @_;
       $self->{'sql_columns'} = [ 'fulltxt', 'COUNT(fulltxt)' ];
       $self->{'sql_params'} = {where => "fulltxt = 'F'", group_by => 'fulltxt'};
       $self->{'visualisation'} = EPStats::Visualisation::HTML->new();
}

new

The new function shouldn't ever need to be any different from this:

sub new
{
       my( $class, $params, $database ) = @_;
       my $self = $class->SUPER::new($params, $database);;
       $self->initialise();
       return $self;
}

populate

Every populate function should start by checking the cache.

sub populate
{
   my ($self) = @_;
   my $cache = EPStats::Cache->new($self->{'params'}->get('id'));
   if ($cache->exists)
   {
       $self->{'visualisation'} = $cache->read();
       return;
   }

Next, we have to retreive from the database:

   my $query = $self->{'database'}->get_stats(
           $self->{'params'},
           $self->{'sql_columns'},
           $self->{'sql_params'}
           );

Now we process them. In this case, we don't even need a loop as we know there's only going to be one row. We'll stick the result straight into some html, and save it. Don't forget that if there isn't any data, you still have to output something.

   my @row = $query->fetchrow_array();
   my $html = '' . ($row[1] ? $row[1] : '0') . "";

A little housekeeping:

   $query->finish();

Pop the data into the visualisation:

   $self->{'visualisation'}->set('html',$html);

Finally, we should write to the cache so we don't have to query the database next time.

   $cache->write($self->{'visualisation'});
}

And that's a really simple view.

Visualisation

Currently Visualisations are Graph, Table or HTML. These are what the user will look at in the broswer or download (in the case of CSV).

Functions

All visualisations inherit:

  • new(data_hash) - a hash can optionally be passed containing the values that would otherwise be set using the 'set' function.
  • set(param_name, value) - sets something to something - see subclasses

All visualisations must implement:

  • render() - returns what will be passed to the script.


Visualisation::HTML

The simplest visualisation. Just a chunk of html.

To Populate:

  • set('html', html_string) - takes the html as a string.

Visualisation::Table

The Visualisation::Table currently just passes the buck to its superclass.

There are currently three table Visualisations:

Visualisation::Table::CSV

Returns a CSV table.

To Populate:

  • set('headings', headings_arrayref) - pass an array containing headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.

Visualisation::Table::HTML

A basic HTML table.

To Populate:

  • set('columns', headings_arrayref) - pass an array containing column headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.

And then optionally

  • set('totals', totals_arrayref) - an array of totals to put at the bottom of the table.

Visualisation::Table::HTML_Columned

An HTML table that is rendered in several columns.

Configuration Constants

$default_number_of_rows - an int representing the maximum number of rows the table should have.

Overridden Functions

  • new(data_hash, number_of_rows) - Both data_hash and number_of_rows are optional. Both can be set with 'set'.

To Populate:

  • set('columns', headings_arrayref) - pass an array containing column headings.
  • set('rows', rows_arrayref) - an array of arrayrefs, each referencing a row of data.
  • set('number_of_rows', int) - set the maximum number of rows the table should have.

Visualisation::Graph

The graph objects all use Chart Director to generate graphs. The Graph object initialised the colours that the graph may be using.

Every graph must be created with at least the filename:

  • new({filename => string}) - the filename comes from the ID of the param object.

Configuration Constants

These are set in the 'new' function.

  • $graph_dir - the path to the directory where the image file will be saved.
  • $url_relative - this will have the filename added to the end and put in the img html tag.

Sub Classes

Note that in the Visualisation/Graph/ directory, there is 'GraphLegend.pm'. This is used to create the html for the graph legends.

Visualisation::Graph::Bar.pm

A Bar Graph. It can have one or more bars in each division of the x axis.

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('x_title',string) - The title of the x axis.
  • set('y_title',string) - The title of the y axis.
  • set('x_labels',array_ref) - an array containing the labels for the x axis
  • set('data_series, array_ref) - an array of arrayrefs, referencing data for each set of bars

Visualisation::Graph::Line.pm

A Line Graph. There can be many lines on it

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('x_title',string) - The title of the x axis.
  • set('y_title',string) - The title of the y axis.
  • set('x_labels',array_ref) - an array containing the labels for the x axis
  • set('data_series, array_ref) - an array of arrayrefs, referencing data for each line

Visualisation::Graph::Pie.pm

A Pie Graph

To implement:

  • set('title',string) - The title that will be in the graph image.
  • set('data_series, array_ref) - an array of hashrefs, {data => int, citation => string}, one for each slice