Difference between revisions of "IRStats2"

From EPrints Documentation
Jump to: navigation, search
(Dependencies)
Line 38: Line 38:
  
 
Both can usually be installed via your Linux package managers (apt-get, yum, ...) or via CPAN.
 
Both can usually be installed via your Linux package managers (apt-get, yum, ...) or via CPAN.
 +
 +
e.g. in Debian/Ubuntu:
 +
<pre>
 +
apt-get install libgeo-ip-perl libdate-calc-perl
 +
</pre>
  
 
==EPrints 3.3==
 
==EPrints 3.3==

Revision as of 09:40, 13 December 2017

What is IRStats2?

IRStats2 is a statistical framework for EPrints - It comes with some cool default tools and reports and it can also be customised to, for instance, add new metrics or data sets. It has a Javascript API to include stats on any pages you want.

IRStats2 is developed against EPrints 3.3 but it was written to also work on EPrints 3.2. Older versions of EPrints are, however, not supported.

What's new in version 1.1?

This new version includes a number of improvements to existing features such as easier deployment, faster database code, tool tips and improved browser detection, as well as a number of smaller tweaks and fixes.

It now also includes filtering to allow the blocking of web crawling robots as standard.

Changes in 1.1, since 1.0.x

Updates:

  • Feature: IP based robot filtering and default values
  • Feature: Adding an option to only show live items in the stats
  • Enhancement: Avoid using experimental perl code.(i.e. ~~ )
  • Enhancement: restructure to make epm deployment easier
  • Enhancement: tooltip help text for KeyFigures
  • Enhancement: Optimisation for innodb
  • Enhancement: CSV, JSON, XML saves file as instead of open directly in the browser
  • Enhancement: Added missing libraries check (Date::Calc and Geo::IP) on bazzar installation page. resolves #10*
  • Bugfix: Stats::View::google::Graph lose first statistics #69
  • Bugfix for Browser identification issue #66. Browser ID should be further improved

Merged pull requests:

  • Enhancement: %IGNORE_LIST of words (stopwords) are very few and only in "en"
  • Enhancement: Add support for transactions
  • Bugfix: The title of Screen::IRStats2::Report should not change according to report you chose
  • Bugfix: Avoid XSS vulnerability in some CGI output

Installation

Dependencies

 *Geo::IP or Geo::IP::PurePerl
 *Date::Calc

Both can usually be installed via your Linux package managers (apt-get, yum, ...) or via CPAN.

e.g. in Debian/Ubuntu:

apt-get install libgeo-ip-perl libdate-calc-perl

EPrints 3.3

IRStats2 can be installed directly via the Bazaar on EPrints 3.3.

EPrints 3.3.11 onwards

Installing IRStats2 from the Bazaar is all you need to do. It is recommend that you restart Apache after doing so.


EPrints 3.3.1 to 3.3.10

You need to install IRStats2 from the Bazaar as above, but you also need to apply a few patches to enable the Google map showing the "Origins of downloads".

The patches relate to an incompatibility between the Prototype JS library (used by EPrints) and Google Charts (used by IRStats2). The two patches you need to apply are:

EPrints 3.2.x

On EPrints 3.2 you will have to manually copy the required files to your EPrints installation path. It is a low-risk operation since IRStats2 is a true add-on to EPrints and it does not interact with the core software. You may want to back-up your EPrints files and your database but again, this should not be necessary.

1. Get the files from GitHub or by following this [link https://github.com/eprints/irstats2/tarball/master] [tar.gz] 2. Copy the modules and various configuration files to your local archive:

cp bin/* /opt/eprints3/archives//bin/
cp cfg/* /opt/eprints3/archives//cfg/
cp cgi/* /opt/eprints3/archives//cgi/

(create the bin and cgi directories if they don't exist).

3. Test everything is OK:

/opt/eprints3/bin/epadmin test

4. Add in the <head> sections of your template files (usually located in /opt/eprints3/archives//cfg/lang/en/templates/) the following:

<script type="text/javascript" src="http://www.google.com/jsapi">// <!-- No script --></script>
<script type="text/javascript">
        google.load("visualization", "1", {packages:["corechart", "geochart"]});
</script>

5. Restart the web server

Processing

Processing works in two steps: the initial processing and then a daily incremental processing. Because the initial processing will take care of all your legacy "download" data, this can take a (very) long time. It may take a few days if your repository is very large, although more likely it will take a few hours.

For the initial processing, run, as the "eprints" user, the below command (and remember this may take a long time to complete). If you are running it from an SSH session, you may want to use the "screen" Linux utility to make sure your SSH session will persist.

/opt/eprints3/archives/REPO_ID/bin/stats/process_stats REPO_ID --setup --verbose

For the daily incremental processing, add the below line in cron. It is a good idea to let this run over-night when there is less traffic to your repository.

perl /opt/eprints3/archives/REPO_ID/bin/stats/process_stats REPO_ID 1>/dev/null 2>/dev/null
The two redirections to /dev/null forces the process to not output anything.

When the initial processing has completed, you may point your browser to http://yourrepo.url/cgi/stats/report to look at some stats!

Configuration

This section details how to configure IRStats2 and mostly relates to the file cfg/cfg.d/z_irstats2.pl.

It is recommended that you edit your changes in a separate file (eg. zz_irstats2_local.pl - must be loaded AFTER z_irstats2.pl) as this will make Bazaar updates easier to apply.

Datasets/Datatypes

Since IRStats2 can handle any EPrints datasets (not just the 'access' dataset which records downloads), you can declare in the configuration which EPrints datasets to process. For each EPrints dataset configured, IRStats2 will pass on the records from the Database to each processing module. This is coupled to the Stats::Processor modules and you will see that, by default, IRStats2 processes:

  • The "access" dataset with the associated Stats::Processor::Access modules
  • The "eprint" dataset with the associated Stats::Processor::EPrint modules
  • The "history" dataset with, as you have guessed, the Stats::Processor::History modules

Each module will provide specific datum, which is declared in the module itself. For instance, Stats::Processor::Access::Downloads provides us with the "downloads" and "views" data-types.

Configuration example and options

access => { 
	filters => [ 'Robots', 'Repeat' ], 
	incremental => 1 
}

The only two options which can be used are:

  • incremental: 1 or 0 (default 1) - tells IRStats2 to incrementally process the DB records. Since IRStats2 data must be processed daily, this indicates whether you should reprocess the entire dataset every day. For downloads (ie. the "access" dataset), you only need to reprocess the daily downloads, there is no need to restart from 0. However, some metrics used for the "eprint" dataset requires the entire dataset to be re-processed daily, which is OK as the "eprint" dataset is usually much smaller than the "access" one.
  • filters: an array of Filters (default []) - tells IRStats2 to apply filters before processing the records. This is especially useful for "access" records where hits by robots/crawlers are usually removed. Filters are very similar to Processor modules, except that they must return a boolean to indicate whether to keep or to discard the record. If the record is kept then it is passed on to the related Processor modules.

Remember that if you want to process new datasets (e.g. "user") then you must write the associated Stats::Processor modules, otherwise nothing would happen.

Sets

A Set tells IRStats2 how to group data points and it is done via an existing ("eprint") meta-field. Each value of that set (in essence, the distinct values of the field) will become a set value you can use in IRStats2 to give you statistics on the value. For instance, you can get download stats by author or by item type. Both "author" and "item type" are sets. Most Set definitions are straight-forward to declare, with the exception of "creators" (a.k.a. "authors").

Configuration example and options

{
                'field' => 'divisions',
                'groupings' => [ 'authors' ]
},

This defines the Set "divisions" - if the divisions field reflects the hierarchical structure of your institution (as it should) then you can get stats per division/school/faculty. You can also get "Top publications" per division.

Here are all the options you may use when defining a Set:

  • name: (optional - default to 'field') - the name of the set
  • field: the "eprint" field to use to generate set values
  • groupings: (optional - default to []) - an ARRAY of set names to use as groupings. A new grouping, withing a set, fills in the statement: "I want to be able to see Top Y per set". For instance for the set 'divisions' and the grouping 'authors': "I want to be able to see Top Authors per Divisions".
  • anon: (optional - default to ) - whether to make the set values anonymous (and hex MD5 is used instead). This is particularly useful when using authors' ID which is usually their email address (and you don't want to make these public).
  • use_ids: For compound fields only (especially for creators). Tell IRStats2 to use the "id" part to generate distinct set values. This is more accurate that using the "name" part only.
  • id_field: For compound fields only. The name of the "id" field - usually it is just "id", as in "creators_id".
  • minimum_filter_length: Used by the Set Finder on the Reports. If set, this only start searching for set values after the user has entered minimum_filter_length characters. Some sets can be large (esp. creators) and we do not really want to preload the potential 100's of thousands of authors names on the UI. Instead we ask the user to search for author's names.
  • render_single_value: A CODEREF that must return a DOM element. This will tell how to render a set value, if you do not wish to use the default renderers. The function will receive three variables: $repo, $setname and $setvalue.

Note that "eprint" is a built-in Set and should not be defined in the configuration. The "eprint" Set is the collection of all the eprints (or "publications") of your repository. It is the assumed Set when no set is declared, as for the scenario "show me the top publications [among the entire repository]".

Reports

Reports are single pages which group different metrics together. The main report page (http://yourrepo.url/cgi/stats/report) is such an example. If you create a new report, "my_report", it will be available at the URL: http://yourepo.url/cgi/stats/report/my_report.

In the configuration, Reports can be seen as a top-to-bottom stack of Stats::View modules. Such modules know how to draw certain stats such as graphs, tables or pie charts, they just need to be position on the report. The module handling the generation of reports (Screen::IRStats2::Report) takes care of passing on the correct context to each Stats::View module. Such contexts include any date filters or set values selected by a visiting user.

#A basic report showing the monthly downloads graph and the top downloaded publications:
my_report => {
	items => [
		{ 
			plugin => 'ReportHeader'
		},
		{
			plugin => 'Google::Graph',
                        datatype => 'downloads',
                        options => {
                                date_resolution => 'month',
                                graph_type => 'column',
                        },
		},
                {
                        plugin => 'Table',
                        datatype => 'downloads',
                        options => {
                                limit => 10,
                                top => 'eprint',
                                title_phrase => 'top_downloads'
                        },
                },

	],
};

The options are detailed in the API section.

Security

Users must have the following two roles to view stats:

  • +irstats2/view
  • +irstats2/export

However these two roles are given to the "public" by default, meaning that anyone can view and/or export the stats. These lines may be commented out in the configuration to prevent this behaviour.

API

This section presents a few examples on how to get data out of IRStats2 for embedding data on pages or for re-use in analysis scripts (for instance).

There are two ways to get data out:

  • From a script: this is the real API, using PERL
  • From an Ajax request: this is usually to embed data on pages

Core concepts

Datatype

Datatype refers to which data to provide with IRStats2 able to process any of data on your repository. The typical use of IRStats2 is however for usage statistics so this is the main dataset, but data on deposits, open access, full text (etc) are also processed. Some repositories even include data from scopus (citation counts).

Main datatypes:

  • downloads: good old download statistics - downloads of full-text documents
  • views: number of hits on the summary page (of a publication)
  • deposits: number of publications deposited
  • doc_access: provides 4 metrics (full_text, no_full_text, open_access and no_open_access) used for computing percentages of Open Access and Full-Text *documents in the repository
  • doc_format: MIME type of full-texts
  • history: analysis of the "history" dataset - this provides information on when publications were created, edited, made live, deleted etc.
  • referrer: information on how site visitors got to the repository (eg. from Google, internal uni pages, etc)
  • search_terms: if coming from a search site (or the internal EPrints search) which words were used to get to the publicaiton
  • browser: which browser visitors used on the repository

Sets

By default, IRStats2 returns data over the entire repository, i.e. the entire set of eprints is assumed. You can however restrict which "set" to use: the publications of an author, of a university division, of a subject, etc.

Dates and ranges

You can also restrict by dates or by a range. By default, all the stats are returned without any dates restrictions.

Dates can be set as YYYYMMDD or YYYY-MM-DD or YYYY/MM/DD (eg. 20170101, 2017-11-04 etc). Dates is a hash containing two keys: from and to (either may be omitted to say: from that particular date, or up to that particular date).

Ranges follow a %d%c format and the upper limit is "now" or "today", for instance:

  • 6m: over the past 6 months
  • 12d: over the past 12 days
  • 3y: over the past 3 years

Only "m" (months), "d" (days) or "y" (years) may be used. 12m is the same as 1y.

Groupings

This tells IRStats2 how to group data and is generally only used for things like "give me the TOP eprints", "give me the TOP authors".

So having a "grouping" set to "eprint" means the top eprints. If set to "authors", the top authors etc. The grouping must be a valid set except for when it equals to "eprint".

Misc

It is possible to limit the amount of records being returned (for when this is relevant: if you want the top downloads, since the beginning of time, then you'd only get one data row back, which is that count). But for queries which ask for, say, the top authors, it is then interesting to be able to get only the first 10 authors. 10 here is the limit.

It is also possible to ask IRStats2 to return certain data field in queries. For top eprints, you generally want the "eprintid" field. To draw timeline graphs (eg. evolution of downloads over-time), you'd want the "datestamp" field. More examples are illustrated below.

Data from scripts

Main API

# get the IRStats2 handler, required to query IRStats2
my $handler = $repo->plugin( "Stats::Handler" );

# ask IRStats2 to show debug statements (SQL queries)
$handler->debug(1);

# Create a Context object
my $ctx = $handler->context( { datatype: "downloads" } );

# Retrieve data rows
my $data = $handler->data( $ctx )->select();

# How many rows returned:
printf "I got %d data rows back\n", $data->count;

# Get stats for divisions "uos-ecs":
$ctx->set( { set_name => 'divisions', set_value => 'uos-ecs' } );

# Get stats over the last 6 months:
$ctx->dates( { range => '6m' } );

# Get stats between 1st January 2012 and 31st March 2012:
$ctx->dates( { from => '20120101', to => '20120331' } );

# Data may be exported (see Stats/Export/ for a list of currently supported plug-ins):
my $export = $repo->plugin( "Stats::Export::CSV" );
$data->export( { export_plugin => $export } );

Full Examples

Actually those are not really full examples. They assume you can write the beginning of a PERL script and that you have already instantiated the Stats Handler (cf. above) as $handler.

# How many downloads in total over the entire repository
my $ctx = $handler->context( { datatype => "downloads" } );
printf "I got %d downloads\n", $handler->data( $ctx )->select->sum_all;
# How many downloads in 2013 over the entire repository
my $ctx = $handler->context( { datatype => "downloads", range => "2013" } );
printf "I got %d downloads\n", $handler->data( $ctx )->select->sum_all;
# The top 5 EPrints over the entire repository
my $ctx = $handler->context( { grouping => "eprint", datatype => "downloads" } );

my $stats = $handler->data( $ctx )->select( fields => ["eprintid"], limit => 5 );

foreach( @{ $stats->data } )
{
        printf "EPrint %d got %d downloads\n", $_->{eprintid}, $_->{count};
}
# The top 10 Subjects (let's assume LoC) for deposits (not downloads!!)
my $ctx = $handler->context( { set_name => "subjects", datatype => "deposits" } );

my $stats = $handler->data( $ctx )->select( fields => ["set_value"], limit => 10 );

my $i = 1;
foreach( @{ $stats->data } )
{
        printf "%d) %s with %d items deposited\n", $i++, $_->{set_value}, $_->{count};
}
# The top 5 downloaded EPrints for LoC Subject "D1"
my $ctx = $handler->context( { set_name => "subjects", set_value => 'D1', datatype => "downloads" } );

my $stats = $handler->data( $ctx )->select( fields => ["eprintid"], limit => 5 );

my $i = 1;
foreach( @{ $stats->data } )
{
        printf "%d) EPrintd %d with %d downloads\n", $i++, $_->{eprintid}, $_->{count};
}

Embedding data

This is similar to retrieving data from scripts (cf. section above) but with a few extra options:

  • "view": the name of the Stats::View plug-in which will draw the requested stuff (a Table? a Graph? etc.)
  • "container_id": the DOM element "id", where the drawn stuff will be inserted on the page (if the Ajax callback is successful)

Then there exists a number of options proper to each View plug-in. See the provided examples below.

Graphs

The typical example is to embed the global downloads graph. This is usually the first displayed item on the IRStats2 main report page (/cgi/stats/report).

<!--
This will basically insert the downloads graph into the "mygraph" div element. Note that it's using the supplied
     "irstats2_googlegraph" CSS class.
      
      Graph options:
      - graph_type: either "column" or "area"
      - show_average: either 1 or 0 - displays the average graph
      - date_resolution: either "year", "month" or "day" - groups data by year, month or day (be careful: selecting day may generate LOTS of data points)
-->

<div id="mygraph" class="irstats2_googlegraph"/>

<script type="text/javascript">
document.observe("dom:loaded",function(){
         new EPJS_Stats_GoogleGraph( { 
                'context': { 'datatype': 'downloads' }, 
                'options': { 'graph_type': 'column', 'container_id': 'mygraph', 'view': 'Google::Graph', 'show_average': '1', 'date_resolution': 'month' } 
        });
});
</script>

Tables

The example below displays the top 10 downloaded eprints in the repository.

<!--
This will insert the top table into the "mytable" div element. Note that it's using the supplied
     "irstats2_table" CSS class.
      
      Table options:
      - top: the top "thing" to display - similar to the "grouping" parameter when using scripts
      - limit: the max number of items to retrieve
      - show_count: 1 or 0 - display the counts or not
      - show_order: 1 or 0 - display the ordering (1,2,3...) or not
      - show_more: 1 or 0 - shows the "show more" options or not (to retrieve more results)
      - human_display: 1 or 0 - separate 1000 with a comma (as done in English): 10000 becomes 10,000
-->
<div id="mytable" class="irstats_table"/>

<script type="text/javascript">
document.observe( "dom:loaded", function() {

        new EPJS_Stats_Table( {
                'context': { 'datatype': 'downloads' },
                'options': { 'container_id': 'mytable', 'top': 'eprint', 'view': 'Table', 'limit': '5' }   
        } );

});
</script>

Misc

Graphs and Tables are the most common displays, but there are a few other ones to explore. The javascript classes are in 90_irstats2.js and the associated PERL Class in Stats/View/

  • GoogleSpark: similar to GoogleGraph but shows a sparkline instead (which is essentially a tiny graph).
  • GoogleGeoChart: country map
  • GooglePieChart: a pie chart
  • Counter: a simple counter (for instance to show the download count for your repository).

The View prefixed by "Google" means that they are rendered by the Google Chart Javascript library. Important note: no data is sent to Google! The data is, instead, drawn by the browser client using SVG