IRStats 2 Technical Documentation

From EPrints Documentation
Revision as of 15:47, 4 September 2015 by Martin.braendle@id.uzh.ch (talk | contribs) (Example how to create a processor module added (not yet finished))
Jump to: navigation, search

Configuration

This section details how to configure IRStats2 and mostly relates to the file cfg/cfg.d/z_irstats2.pl.

It is good practice to edit your changes in in a separate file (eg. zz_irstats2_local.pl) alphabetically after zz_irstats2.pl (files load alphabetically and override each other) as this will make Bazaar updates easier to apply.

Datasets/Datatypes

Since IRStats2 can handle any EPrints datasets (not just the 'access' dataset which records downloads), you can declare in the configuration which EPrints datasets to process. For each EPrints dataset configured, IRStats2 will pass on the records from the Database to each processing module. This is coupled to the Stats::Processor modules and you will see that, by default, IRStats2 processes:

  • The "access" dataset with the associated Stats::Processor::Access modules
  • The "eprint" dataset with the associated Stats::Processor::EPrint modules
  • The "history" dataset with, as you have guessed, the Stats::Processor::History modules

Each module will provide specific datum, which is declared in the module itself. For instance, Stats::Processor::Access::Downloads provides us with the "downloads" and "views" data-types.

Configuration example and options

access => { 
	filters => [ 'Robots', 'Repeat' ], 
	incremental => 1 
}

The only two options which can be used are:

incremental
1 or 0 (default 1) - tells IRStats2 to incrementally process the DB records. Since IRStats2 data must be processed daily, this indicates whether you should reprocess the entire dataset every day. For downloads (ie. the "access" dataset), you only need to reprocess the daily downloads, there is no need to restart from 0. However, some metrics used for the "eprint" dataset needs that the entire dataset is re-processed daily, which is OK as the "eprint" dataset is usually much smaller than the "access" one.
filters
an array of Filters (default []) - tells IRStats2 to apply filters before processing the records. This is especially useful for "access" records where hits by robots/crawlers are usually removed. Filters are very similar to Processor modules, except that they must return a boolean to indicate whether to keep or to discard the record. If the record is kept then it is passed on to the related Processor modules.

Remember that if you want to process new datasets (e.g. "user") then you must write the associated Stats::Processor modules, otherwise nothing would happen.

Sets

A Set tells IRStats2 how to group data points and it is done via an existing ("eprint") meta-field. Each value of that set (in essence, the distinct values of the field) will become a set value you can use in IRStats2 to give you statistics on the value. For instance, you can get download stats by author or by item type. Both "author" and "item type" are sets. Most Set definitions are straight-forward to declare, with the exception of "creators" (a.k.a. "authors").

Configuration example and options

        {
                'field' => 'divisions',
                'groupings' => [ 'authors' ]
        },

This defines the Set "divisions" - if the divisions field reflects the hierarchical structure of your institution (as it should) then you can get stats per division/school/faculty. You can also get "Top publications" per division.

Here are all the options you may use when defining a Set:

name
(optional - default to 'field') - the name of the set
field
the "eprint" field to use to generate set values
groupings
(optional - default to []) - an ARRAY of set names to use as groupings. A new grouping, withing a set, fills in the statement: "I want to be able to see Top Y per set". For instance for the set 'divisions' and the grouping 'authors': "I want to be able to see Top Authors per Divisions".
anon
(optional - default to ) - whether to make the set values anonymous (and hex MD5 is used instead). This is particularly useful when using authors' ID which is usually their email address (and you don't want to make these public).
use_ids
For compound fields only (especially for creators). Tell IRStats2 to use the "id" part to generate distinct set values. This is more accurate that using the "name" part only.
id_field
For compound fields only. The name of the "id" field - usually it is just "id", as in "creators_id".
minimum_filter_length
Used by the Set Finder on the Reports. If set, this only start searching for set values after the user has entered minimum_filter_length characters. Some sets can be large (esp. creators) and we do not really want to preload the potential 100's of thousands of authors names on the UI. Instead we ask the user to search for author's names.
render_single_value
A CODEREF that must return a DOM element. This will tell how to render a set value, if you do not wish to use the default renderers. The function will receive three variables: $repo, $setname and $setvalue.

Note that "eprint" is a built-in Set and should not be defined in the configuration. The "eprint" Set is the collection of all the eprints (or "publications") of your repository. It is the assumed Set when no set is declared, as for the scenario "show me the top publications [among the entire repository]".

Reports

Reports are single pages which group different metrics together. The main report page (http://yourrepo.url/cgi/stats/report) is such an example. If you create a new report, "my_report", it will be available at the URL: http://yourepo.url/cgi/stats/report/my_report.

In the configuration, Reports can be seen as a top-to-bottom stack of Stats::View modules. Such modules know how to draw certain stats such as graphs, tables or pie charts, they just need to be position on the report. The module handling the generation of reports (Screen::IRStats2::Report) takes care of passing on the correct context to each Stats::View module. Such contexts include any date filters or set values selected by a visiting user.

A basic report showing the monthly downloads graph and the top downloaded publications:

my_report => {
	items => [
		{ 
			plugin => 'ReportHeader'
		},
		{
			plugin => 'Google::Graph',
                        datatype => 'downloads',
                        options => {
                                date_resolution => 'month',
                                graph_type => 'column',
                        },
		},
                {
                        plugin => 'Table',
                        datatype => 'downloads',
                        options => {
                                limit => 10,
                                top => 'eprint',
                                title_phrase => 'top_downloads'
                        },
                },

	],
};

The options are detailed on the API section.

Security aspects

Users must have the following two roles to view stats:

+irstats2/view
+irstats2/export

However these two roles are given to the "public" by default, meaning that anyone can view and/or export the stats. You can comment out these lines in the configuration to prevent that behaviour.

API

This section presents a few examples on how to get data out of IRStats2 for embedding data on pages or for re-use in analysis scripts (for instance).

There are two ways to get data out:

  1. From a script: this is the real API, using PERL
  2. From an Ajax request: this is usually to embed data on pages

Core concepts

Datatype

Which data to provide: IRStats2 also the processing of any data on your repository. The typical use of IRStats2 is however for usage statistics so this is the main dataset. But data on deposits, open access, full text (etc) are also processed. Some repositories even include data from scopus (citation counts).

Main datatypes:

downloads
good old download statistics - downloads of full-text documents
views
number of hits on the summary page (of a publication)
deposits
number of publications deposited
doc_access
provides 4 metrics (full_text, no_full_text, open_access and no_open_access) used for computing percentages of Open Access and Full-Text documents in the repository
doc_format
MIME type of full-texts
history
analysis of the "history" dataset - this provides information on when publications were created, edited, made live, deleted etc.
referrer
information on how site visitors got to the repository (eg. from Google, internal uni pages, etc)
search_terms
if coming from a search site (or the internal EPrints search) which words were used to get to the publicaiton
browser
which browser visitors used on the repository

Sets

By default, IRStats2 returns data over the entire repository ie. the entire set of eprints is assumed. You can however restrict which "set" to use: the publications of an author, of a university division, of a subject, etc.

Dates and ranges

You can also restrict by dates or by a range. By default, all the stats are returned without any dates restrictions.

Dates can be set as YYYYMMDD or YYYY-MM-DD or YYYY/MM/DD (eg. 20140101, 2013-11-04 etc). Dates is a hash containing two keys: from and to (either can be omitted to say: from that particular date, or up to that particular date).

Ranges follow a %d%c format and the upper limit is "now" or "today", for instance:

6m
over the past 6 months
12d
over the past 12 days
3y
over the past 3 years

Only "m" (months), "d" (days) or "y" (years) may be used. You can see that 12m is the same as 1y.

Groupings

This tells IRStats2 how to group data and is generally only used for things like "give me the TOP eprints", "give me the TOP authors".

So having a "grouping" set to "eprint" means the top eprints. If set to "authors", the top authors etc. The grouping must be a valid set except for when it equals to "eprint".

Misc

It is possible to limit the amount of records being returned (for when this is relevant: if you want the top downloads, since the beginning of time, then you'd only get one data row back, which is that count). But for queries which ask for, say, the top authors, it is then interesting to be able to get only the first 10 authors. 10 here is the limit.

It is also possible to ask IRStats2 to return certain data field in queries. For top eprints, you generally want the "eprintid" field. To draw timeline graphs (eg. evolution of downloads over-time), you'd want the "datestamp" field. More examples are illustrated below.

Data from scripts

Main API

# get the IRStats2 handler, required to query IRStats2
my $handler = $repo->plugin( "Stats::Handler" );

# ask IRStats2 to show debug statements (SQL queries)
$handler->debug(1);

# Create a Context object
my $ctx = $handler->context( { datatype: "downloads" } );

# Retrieve data rows
my $data = $handler->data( $ctx )->select();

# How many rows returned:
printf "I got %d data rows back\n", $data->count;

# Get stats for divisions "uos-ecs":
$ctx->set( { set_name => 'divisions', set_value => 'uos-ecs' } );

# Get stats over the last 6 months:
$ctx->dates( { range => '6m' } );

# Get stats between 1st January 2012 and 31st March 2012:
$ctx->dates( { from => '20120101', to => '20120331' } );

# Data may be exported (see Stats/Export/ for a list of currently supported plug-ins):
my $export = $repo->plugin( "Stats::Export::CSV" );
$data->export( { export_plugin => $export } );

Full Examples

Actually those are not really full examples. They assume you can write the beginning of a PERL script and that you have already instantiated the Stats Handler (cf. above) as $handler.

# How many downloads in total over the entire repository

my $ctx = $handler->context( { datatype => "downloads" } );
printf "I got %d downloads\n", $handler->data( $ctx )->select->sum_all;
# How many downloads in 2013 over the entire repository

my $ctx = $handler->context( { datatype => "downloads", range => "2013" } );
printf "I got %d downloads\n", $handler->data( $ctx )->select->sum_all;
# The top 5 EPrints over the entire repository

my $ctx = $handler->context( { grouping => "eprint", datatype => "downloads" } );

my $stats = $handler->data( $ctx )->select( fields => ["eprintid"], limit => 5 );

foreach( @{ $stats->data } )
{
        printf "EPrint %d got %d downloads\n", $_->{eprintid}, $_->{count};
}
# The top 10 Subjects (let's assume LoC) for deposits (not downloads!!)

my $ctx = $handler->context( { set_name => "subjects", datatype => "deposits" } );

my $stats = $handler->data( $ctx )->select( fields => ["set_value"], limit => 10 );

my $i = 1;
foreach( @{ $stats->data } )
{
        printf "%d) %s with %d items deposited\n", $i++, $_->{set_value}, $_->{count};
}
# The top 5 downloaded EPrints for LoC Subject "D1"

my $ctx = $handler->context( { set_name => "subjects", set_value => 'D1', datatype => "downloads" } );

my $stats = $handler->data( $ctx )->select( fields => ["eprintid"], limit => 5 );

my $i = 1;
foreach( @{ $stats->data } )
{
        printf "%d) EPrintd %d with %d downloads\n", $i++, $_->{eprintid}, $_->{count};
}

Embedding data

This is similar to retrieving data from scripts (cf. section above) but with a few extra options:

view
the name of the Stats::View plug-in which will draw the requested stuff (a Table? a Graph? etc.)
container_id
the DOM element "id", where the drawn stuff will be inserted on the page (if the Ajax callback is successful)

Then there exists a number of options proper to each View plug-in. See the provided examples below.

Graphs

The typical example is to embed the global downloads graph. This is usually the first displayed item on the IRStats2 main report page (/cgi/stats/report).

This will basically insert the downloads graph into the "mygraph" div element. Note that it's using the supplied "irstats2_googlegraph" CSS class.

Graph options:

graph_type
either "column" or "area"
show_average
either 1 or 0 - displays the average graph
date_resolution
either "year", "month" or "day" - groups data by year, month or day (be careful: selecting day may genearate LOTS of data points)
<div id="mygraph" class="irstats2_googlegraph"/>

<script type="text/javascript">
document.observe("dom:loaded",function(){
         new EPJS_Stats_GoogleGraph( { 
                'context': { 'datatype': 'downloads' }, 
                'options': { 'graph_type': 'column', 'container_id': 'mygraph', 'view': 'Google::Graph', 'show_average': '1', 'date_resolution': 'month' } 
        });
});
</script>

Tables

The example below displays the top 10 downloaded eprints in the repository.

This will insert the top table into the "mytable" div element. Note that it's using the supplied "irstats2_table" CSS class.

Table options:

top
the top "thing" to display - similar to the "grouping" parameter when using scripts
limit
the max number of items to retrieve
show_count
1 or 0 - display the counts or not
show_order
1 or 0 - display the ordering (1,2,3...) or not
show_more
1 or 0 - shows the "show more" options or not (to retrieve more results)
human_display
1 or 0 - separate 1000 with a comma (as done in English): 10000 becomes 10,000
<div id="mytable" class="irstats_table"/>

<script type="text/javascript">
document.observe( "dom:loaded", function() {

        new EPJS_Stats_Table( {
                'context': { 'datatype': 'downloads' },
                'options': { 'container_id': 'mytable', 'top': 'eprint', 'view': 'Table', 'limit': '5' }   
        } );

});
</script>

Misc

Graphs and Tables are the most common displays - but there exists a few other ones which I let you explore. The javascript classes are in 90_irstats2.js and the associated PERL Class in Stats/View/

GoogleSpark
similar to GoogleGraph but shows a sparkline instead (which is essentially a tiny graph).
GoogleGeoChart
country map
GooglePieChart
a pie chart
Counter
a simple counter (for instance to show the download count for your repository).

The View prefixed by "Google" means that they are rendered by the Google Chart Javascript library. Important note: no data is sent to Google!! The data is, instead, drawn by the browser client using SVG.

Example: Creating a processor for citation statistics

IRStats2 can evaluate any field or combination of fields in the above mentioned datasets. For this, new datatypes and associated Stats::Processor modules must be created.

A simple case is statistics on citation count of publications, since the datatype of the associated field is a scalar.

Citation counts can be harvested using the Citation count dataset and import plugins from Queensland University of Technology. Upon following the installation procedure, the fields scopus_impact, wos_impact and gscholar_impact are created in the EPrint table.

Modifying an existing Processor module

We create now a Processor module that evaluates the scopus_impact field and provides a scopus_citations datatype.

First, we inspect the Stats::Processor::EPrint Processor modules to find out which Processor module is best suited for adapting. A good candidate is lib/plugins/EPrints/Plugin/Stats/Processor/EPrint/DocumentAccess.pm. We copy it to archives/{repo}/cfg/plugins/EPrints/Plugin/Stats/Process/EPrint/ScopusCitations.pm and modify it as follows:

package EPrints::Plugin::Stats::Processor::EPrint::ScopusCitations;

our @ISA = qw/ EPrints::Plugin::Stats::Processor /;

use strict;

sub new
{
	my( $class, %params ) = @_;
	my $self = $class->SUPER::new( %params );

#  provide the name of the datatype
	$self->{provides} = [ "scopus_citations" ];

	$self->{disable} = 0;

	return $self;
}


sub process_record
{
	my ($self, $eprint ) = @_;

	my $epid = $eprint->get_id;
	return unless( defined $epid );

	my $status = $eprint->get_value( "eprint_status" );
	unless( defined $status ) 
	{
##		print STDERR "IRStats2: warning - status not set for eprint=".$eprint->get_id."\n";
		return;
	}

	return unless( $status eq 'archive' );

	my $datestamp = $eprint->get_value( "datestamp" ) || $eprint->get_value( "lastmod" );

	my $date = $self->parse_datestamp( $self->{session}, $datestamp );

	my $year = $date->{year};
	my $month = $date->{month};
	my $day = $date->{day};

# get the citation count
	my $scopus_citation_count = $eprint->get_value( "scopus_impact" );

# store the citation count per eprint id
	if (defined $scopus_citation_count)
	{
		$self->{cache}->{"$year$month$day"}->{$epid}->{scopus} = $scopus_citation_count;
	}
}

1;

Next, we need to enable the new Processor plugin in our IRStats2 configuration file. In the Bazaar config section, add the following line:

$c->{plugins}{"Stats::Processor::EPrint::ScopusCitations"}{params}{disable} = 0;

In the Reports section, we can add now a new citation report that uses the scopus_citations datatype:

        citations => {
                items => [
                { plugin => 'ReportHeader' },
                {
                        plugin => 'Grid',
                        options => {
                                items => [
                                {
                                        plugin => 'Table',
                                        datatype => 'scopus_citations',
                                        options => {
                                                limit => 10,
                                                top => 'eprint',
                                                title_phrase => 'top_scopus_citations',
                                        },
                                },]
                        },
                },
                {
                        plugin => 'Grid',
                        options => {
                                items => [
                                {
                                        plugin => 'Table',
                                        datatype => 'scopus_citations',
                                        options => {
                                                limit => 10,
                                                top => 'authors',
                                                title_phrase => 'top_scopus_citations_authors'
                                        }
                                },]
                        },
                },
                ],
                category => 'advanced',
        },

Also, the phrases in irstats2.xml must be completed accordingly.


Generalization

(will follow soon: description to create a generic scalar value tracker)

Specific requirements by Scopus and Web of Science

If you have a valid API key to the Scopus and Web of Science APIs, it is allowed by both database producers to show aggregate citation counts, given that a link back to the record in the Scopus and Web of Science database and attribution to their brand is provided. I leave that to you to figure out how this can be achieved in the Table view.

Google Scholar does not allow to harvest citation counts.