IRStats Technical Documentation

From EPrints Documentation
Revision as of 19:52, 29 March 2007 by Gobfrey (talk | contribs) (Date)
Jump to: navigation, search

This document is intended as guidance to the last stage of development of EPstats.

Directory Structure

/opt/epstats

Contains data files for GeoIP. If I had had root access, I would have put them in the correct place. They are linked to from the correct place. These need regular updating, something which hasn't been implemented.

/opt/epstats/bin

Contains the scripts needed to update the table.

  • daily_update.sh - Runs all the scripts in the right order.
  • extract_metadata_from_archive.pl - Extracts eprint, author and group metadata from the repository by iterating over every eprint.
  • update_table.pl - Filters and processes new entries in the accesslog to update the epstats_true_acesses_table. Uses 'SearchParser.pm' and 'repeatscache'.
  • convert_ip_to_host.pl - Attempts to convert ip addresses of the new entries in epstats_true_acesses_table to hostnames. Uses 'host_updated' to keep track of where it got to last time.

Note that most of these scripts probably need to be tidied up. They were written in a hurry and were never polished.

/opt/epstats/cache

Contains cache files. Feel free to delete these whenever you like.

/opt/epstats/cgi

Contains two scripts, 'get_view' and 'stats'.

  • get_view returns the output of a EPstats::View (see below), which is currently a chunk of html or csv, but could be almost anything.
  • stats is a handy cgi form that passes arguements to get_view

/opt/epstats/img

Conceptually, where any images would be kept (e.g. national flags). At the moment, only the img/graphs directory is used. This is where generated graphs are stored.

/opt/epstats/perl_lib

Contains all the epstats classes.

EPStats Classes

Note that the leading EPStats:: has been left out for brevity.

Params

This object holds the parameters that are used to generate the statistics. The most imortant of these are a date range and an eprint set.

Configuration Constants

  • $cgi_script - the name of the cgi script (currently unused)
  • $id_params - When generating an ID, which parameters are important.
  • $defaults - Any default parameters you wish to set.

Functions

  • new(CGI_object) - returns new object
  • mask(params_hash) - used when you want to temporarily overwrite parameter(s). Overwrites values with contents of params_hash. Overwritten values get pushed onto a stack.
  • unmask - Sets parameters back to how they were before the last mask.
  • generate_cgi - returns a string containing the name of the cgi script, and all parameters, to enable the creation of links. (currently unused)
  • get(param_name) - returns the value of a single parameter.
  • create_id - Uses MD5 to create a unique ID from the id_params (see Constants above). This is called whenever get('id') is called.

DatabaseInterface

This object does what it says on the tin. Any access to the database is done though it.

IMPORTANT - the mysql generated has been developed on a machine running mysql 5. Installing on the EPrints server has broken this (as it's running mysql 4). I placed a quick and dirty hack into the do_sql function, and modified the create_top_table function. I have no idea if this works well. IT NEEDS TO BE CHECKED.

Configuration Constants

Constants are contained in the new function.

  • DBI Configuration Constants - $driver, $server, $database, $user, $password are all used to create the connection to the database.
  • source_table - The table in which the stats are stored.

Functions

  • new() - returns object.
  • retreive_set_names() - returns a list of eprint sets. Currently 'group' and 'author' are implemented. This is used to verify cgi input.
  • get_membership(eprint_id, set_name) - For a given eprint ID, which of a named set does it belong to. For example, we can find out which authors eprint 12614 has by get_membership(12614, 'author').
  • get_citation(id, set, length) - returns a citation. Every set member (eprint, author, group) has two citations. short and full. We only return a short citation if length == 'short'. So, to get the short citation of a group 3: get_citation(3,'group','short').
  • get_code($id,$set) - UNWRITTEN - Set member have codes. This how they are identified by the user. For example author_lac is the member of the author set whose code is lac. To get the code for group 3: get_code(3,'group').

When retreiving statistics, EPStats filters by inner joining the epstats_true_accesses_table to other tables contining eprint IDs. Sometimes it has to create these tables.

  • create_top_table(param_object) - This creates a table containing the eprint IDs of the top X by fulltext download between two dates.
  • create_list_table(table_name, eprint_ids) - Takes two strings, one the name of the table, the other a space seperated list of eprint IDs. Creates a temporary table.

The following are the only two functions that actually make calls to the database.

  • do_sql(sql_query_string) - takes a string and performs a query, returning the dbi object containing the results.
  • insert_values(table_name, values) - inserts a row of data into a table.

And finally, the meat and potatoes. The functions that return the statistics we're interested in.

  • get_stats(params_object, column_name_list, options_hash) - returns a dbi object containing the stats we are interested. i.e. the params_object's date range and eprints sets, and only the columns in column_list. The options hash can contain the following key/value pairs
    • order => column_name - the column on which to order it. append with '-' or ' DESC' to order it descending.
    • limit => int - How many results to return
    • group_by => column_name - if we need to group by a column.

get_stats works by examining the 'eprints' parameter and calling one of the following functions:

    • get_list_stats
    • get_top_stats
    • get_set_stats
    • get_all_stats

These functions generate slightly different mysql queries, and pass them to the do_sql function.

Date

I implemented a date object because there were some specific things I needed to do with dates.

Functions

  • new(date_hash) - Creates a new date object when passed a hash with the keys 'day', 'month' and 'year'.
  • validate() - If the date is not valid, it will be modified to a sensible value. E.G. if it's Feb 30th, it will be modified to Feb 29th or 28th, dependant on if it's a leap year.
  • set(part_name, int) - Sets part of the date ('year','month' or 'day') to a specific value.
  • decrement(period) - increments the date by a period ('day', 'week', 'month', 'quarter', 'year'). Calls the mod_date function, which does the muscle work.
  • increment(period) - decrements by calling mod_date.
  • part(part_name, style - Returns the day, month or year. For month, if style=='text', returns a three letter string, otherwise returns an integer. For year, if style=='short', returns the last two digits, otherwise returns all four.
  • less_than(date_object) - compares itself to another date object. Returns 1 if it's less than it, otherwise returns 0.
  • greater_than(date_object) - compares itself to another date object. Returns 1 if it's greater than it, otherwise returns 0.
  • month_name() - returns the three letter string of the month.
  • render(format_string) - returns a date string. Format can be:
    • 'short' - Calls render_abbreviated - returns a date like this: 05-Jul-77
    • 'long' - Calls render_full (not implemented).
    • 'numerical' (default) - Calls render_numerical - returns a date like this: 19770705
  • clone - returns an new, identical date object.


Cache


Periods

UserInterface::Controls

Page (depricated)


View View.pm Visualisation Visualisation.pm