IRStats2

From EPrints Documentation
Revision as of 12:29, 16 November 2017 by Rwf1v07@soton.ac.uk (talk | contribs)
Jump to: navigation, search

What is IRStats2?

IRStats2 is a statistical framework for EPrints - It comes with some cool default tools and reports and it can also be customised to, for instance, add new metrics or data sets. It has a Javascript API to include stats on any pages you want.

IRStats2 is developed against EPrints 3.3 but it was written to also work on EPrints 3.2. Older versions of EPrints are, however, not supported.

What's new in version 1.1?

This new version includes a number of improvements to existing features such as easier deployment, faster database code, tool tips and improved browser detection, as well as a number of smaller tweaks and fixes.

It now also includes filtering to allow the blocking of web crawling robots as standard.

Changes in 1.1, since 1.0.x

Updates:

  • Feature: IP based robot filtering and default values
  • Feature: Adding an option to only show live items in the stats
  • Enhancement: Avoid using experimental perl code.(i.e. ~~ )
  • Enhancement: restructure to make epm deployment easier
  • Enhancement: tooltip help text for KeyFigures
  • Enhancement: Optimisation for innodb
  • Enhancement: CSV, JSON, XML saves file as instead of open directly in the browser
  • Enhancement: Added missing libraries check (Date::Calc and Geo::IP) on bazzar installation page. resolves #10*
  • Bugfix: Stats::View::google::Graph lose first statistics #69
  • Bugfix for Browser identification issue #66. Browser ID should be further improved

Merged pull requests:

  • Enhancement: %IGNORE_LIST of words (stopwords) are very few and only in "en"
  • Enhancement: Add support for transactions
  • Bugfix: The title of Screen::IRStats2::Report should not change according to report you chose
  • Bugfix: Avoid XSS vulnerability in some CGI output

Installation

Dependencies

Both can usually be installed via your Linux package managers (apt-get, yum, ...) or via CPAN.

EPrints 3.3

IRStats2 can be installed directly via the Bazaar on EPrints 3.3.

EPrints 3.3.11 onwards

Installing IRStats2 from the Bazaar is all you need to do. It is recommend that you restart Apache after doing so.


EPrints 3.3.1 to 3.3.10

You need to install IRStats2 from the Bazaar as above, but you also need to apply a few patches to enable the Google map showing the "Origins of downloads".

The patches relate to an incompatibility between the Prototype JS library (used by EPrints) and Google Charts (used by IRStats2). The two patches you need to apply are:

EPrints 3.2.x

On EPrints 3.2 you will have to manually copy the required files to your EPrints installation path. It is a low-risk operation since IRStats2 is a true add-on to EPrints and it does not interact with the core software. You may want to back-up your EPrints files and your database but again, this should not be necessary.

1. Get the files from GitHub or by following this [link https://github.com/eprints/irstats2/tarball/master] [tar.gz] 2. Copy the modules and various configuration files to your local archive:

cp bin/* /opt/eprints3/archives//bin/
cp cfg/* /opt/eprints3/archives//cfg/
cp cgi/* /opt/eprints3/archives//cgi/

(create the bin and cgi directories if they don't exist).

3. Test everything is OK:

/opt/eprints3/bin/epadmin test

4. Add in the <head> sections of your template files (usually located in /opt/eprints3/archives//cfg/lang/en/templates/) the following:

<script type="text/javascript" src="http://www.google.com/jsapi">// <!-- No script --></script>
<script type="text/javascript">
        google.load("visualization", "1", {packages:["corechart", "geochart"]});
</script>

5. Restart the web server

Processing

Processing works in two steps: the initial processing and then a daily incremental processing. Because the initial processing will take care of all your legacy "download" data, this can take a (very) long time. It may take a few days if your repository is very large, although more likely it will take a few hours.

For the initial processing, run, as the "eprints" user, the below command (and remember this may take a long time to complete). If you are running it from an SSH session, you may want to use the "screen" Linux utility to make sure your SSH session will persist.

/opt/eprints3/archives/REPO_ID/bin/stats/process_stats REPO_ID --setup --verbose

For the daily incremental processing, add the below line in cron. It is a good idea to let this run over-night when there is less traffic to your repository.

perl /opt/eprints3/archives/REPO_ID/bin/stats/process_stats REPO_ID 1>/dev/null 2>/dev/null
The two redirections to /dev/null forces the process to not output anything.

When the initial processing has completed, you may point your browser to http://yourrepo.url/cgi/stats/report to look at some stats!

Configuration

This section details how to configure IRStats2 and mostly relates to the file cfg/cfg.d/z_irstats2.pl.

It is recommended that you edit your changes in a separate file (eg. zz_irstats2_local.pl - must be loaded AFTER z_irstats2.pl) as this will make Bazaar updates easier to apply.

Datasets/Datatypes

Since IRStats2 can handle any EPrints datasets (not just the 'access' dataset which records downloads), you can declare in the configuration which EPrints datasets to process. For each EPrints dataset configured, IRStats2 will pass on the records from the Database to each processing module. This is coupled to the Stats::Processor modules and you will see that, by default, IRStats2 processes:

  • The "access" dataset with the associated Stats::Processor::Access modules
  • The "eprint" dataset with the associated Stats::Processor::EPrint modules
  • The "history" dataset with, as you have guessed, the Stats::Processor::History modules

Each module will provide specific datum, which is declared in the module itself. For instance, Stats::Processor::Access::Downloads provides us with the "downloads" and "views" data-types.

Configuration example and options

access => { 
	filters => [ 'Robots', 'Repeat' ], 
	incremental => 1 
}

The only two options which can be used are:

  • incremental: 1 or 0 (default 1) - tells IRStats2 to incrementally process the DB records. Since IRStats2 data must be processed daily, this indicates whether you should reprocess the entire dataset every day. For downloads (ie. the "access" dataset), you only need to reprocess the daily downloads, there is no need to restart from 0. However, some metrics used for the "eprint" dataset requires the entire dataset to be re-processed daily, which is OK as the "eprint" dataset is usually much smaller than the "access" one.
  • filters: an array of Filters (default []) - tells IRStats2 to apply filters before processing the records. This is especially useful for "access" records where hits by robots/crawlers are usually removed. Filters are very similar to Processor modules, except that they must return a boolean to indicate whether to keep or to discard the record. If the record is kept then it is passed on to the related Processor modules.

Remember that if you want to process new datasets (e.g. "user") then you must write the associated Stats::Processor modules, otherwise nothing would happen.

Sets

A Set tells IRStats2 how to group data points and it is done via an existing ("eprint") meta-field. Each value of that set (in essence, the distinct values of the field) will become a set value you can use in IRStats2 to give you statistics on the value. For instance, you can get download stats by author or by item type. Both "author" and "item type" are sets. Most Set definitions are straight-forward to declare, with the exception of "creators" (a.k.a. "authors").

Configuration example and options

{
                'field' => 'divisions',
                'groupings' => [ 'authors' ]
},

This defines the Set "divisions" - if the divisions field reflects the hierarchical structure of your institution (as it should) then you can get stats per division/school/faculty. You can also get "Top publications" per division.

Here are all the options you may use when defining a Set:

  • name: (optional - default to 'field') - the name of the set
  • field: the "eprint" field to use to generate set values
  • groupings: (optional - default to []) - an ARRAY of set names to use as groupings. A new grouping, withing a set, fills in the statement: "I want to be able to see Top Y per set". For instance for the set 'divisions' and the grouping 'authors': "I want to be able to see Top Authors per Divisions".
  • anon: (optional - default to ) - whether to make the set values anonymous (and hex MD5 is used instead). This is particularly useful when using authors' ID which is usually their email address (and you don't want to make these public).
  • use_ids: For compound fields only (especially for creators). Tell IRStats2 to use the "id" part to generate distinct set values. This is more accurate that using the "name" part only.
  • id_field: For compound fields only. The name of the "id" field - usually it is just "id", as in "creators_id".
  • minimum_filter_length: Used by the Set Finder on the Reports. If set, this only start searching for set values after the user has entered minimum_filter_length characters. Some sets can be large (esp. creators) and we do not really want to preload the potential 100's of thousands of authors names on the UI. Instead we ask the user to search for author's names.
  • render_single_value: A CODEREF that must return a DOM element. This will tell how to render a set value, if you do not wish to use the default renderers. The function will receive three variables: $repo, $setname and $setvalue.

Note that "eprint" is a built-in Set and should not be defined in the configuration. The "eprint" Set is the collection of all the eprints (or "publications") of your repository. It is the assumed Set when no set is declared, as for the scenario "show me the top publications [among the entire repository]".

Reports

Reports are single pages which group different metrics together. The main report page (http://yourrepo.url/cgi/stats/report) is such an example. If you create a new report, "my_report", it will be available at the URL: http://yourepo.url/cgi/stats/report/my_report.

In the configuration, Reports can be seen as a top-to-bottom stack of Stats::View modules. Such modules know how to draw certain stats such as graphs, tables or pie charts, they just need to be position on the report. The module handling the generation of reports (Screen::IRStats2::Report) takes care of passing on the correct context to each Stats::View module. Such contexts include any date filters or set values selected by a visiting user.

#A basic report showing the monthly downloads graph and the top downloaded publications:
my_report => {
	items => [
		{ 
			plugin => 'ReportHeader'
		},
		{
			plugin => 'Google::Graph',
                        datatype => 'downloads',
                        options => {
                                date_resolution => 'month',
                                graph_type => 'column',
                        },
		},
                {
                        plugin => 'Table',
                        datatype => 'downloads',
                        options => {
                                limit => 10,
                                top => 'eprint',
                                title_phrase => 'top_downloads'
                        },
                },

	],
};

The options are detailed in the API section.

Security

Users must have the following two roles to view stats:

  • +irstats2/view
  • +irstats2/export

However these two roles are given to the "public" by default, meaning that anyone can view and/or export the stats. These lines may be commented out in the configuration to prevent this behaviour.

API