Dataset Manipulation, Triggers and Events

From EPrints Documentation
Revision as of 16:45, 3 February 2014 by Af05v@ecs.soton.ac.uk (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

In this exercise, we are going to look at a lot of capabilities of EPrints combined in one exercise.

The overall aim is to add a Trigger which queues an Event on the indexer to scan a file for it's mime type (according to the unix file command) and update this data in the file dataset

We shall also add a new dataset although we won't specifically be using it.

Adding To Datasets

This is done with a config file under cfg/cfg.d/ which we can call package_name.pl for example.

The following shows an example of adding the file_cmd_mime to the file dataset.

 $c->add_dataset_field( "file", {
       name => "file_cmd_mime",
       type => "text",
 }, reuse => 1 );

Note that we can specify the reuse flag to say if we can reuse this field if it already exists. If this is set to 0 (or not defined) and the field already exists the package install will fail.

Note also that if none of the config files define the field, then it is assumed it no longer required and it is removed (data included). Some investigation is need to see if upgrades work... (:S)

Adding New Datasets

Adding new datasets is much the same as adding to an existing one other than the fact we need to define the dataset and the basic class wrappers.

 # Enable the event
 $c->{plugins}{"Event::ScanFile"}{params}{disable} = 0;
 
 # Define the dataset
 $c->{datasets}->{package_dataset} = {
      class => "EPrints::DataObj::PackageDataset",
      sqlname => "package_dataset",
      datestamp => "datestamp",       
      sql_counter => "datasetid",
 };
 
 # Add fields to the dataset
 $c->add_dataset_field( "package_dataset", { name=>"datasetid", type=>"counter", required=>1, can_clone=>0, sql_counter=>"datasetid" }, );
 $c->add_dataset_field( "package_dataset", { name=>"name", type=>"text", required=>0, }, );
 $c->add_dataset_field( "package_dataset", { name=>"count", type=>"int", required=>0, }, );
 
 # Define the class, this can either be done using a new file in the right place, or by using this override trick, open a '{' and then continue as it this is new file
 {
   package EPrints::DataObj::PackageDataset;
 
   our @ISA = qw( EPrints::DataObj );
   
   # The new method can simply return the constructor of the super class (Dataset)
   sub new
   {
       return shift->SUPER::new( @_ );
   }
 
   # This method is required to just return the dataset_id.
   sub get_dataset_id
   {
       my ($self) = @_;
       return "package_dataset";
   }
 
 }

Events

Events are things which can be triggered by the indexer at various times. Because we don't want to have to wait for out mime type scan to complete and are not bothered when it completes we may as well make an event which can run at a convienient time.

Other examples of events are:

  • Thumbnail Generation
  • Full text indexing
  • RDF generation

These are all things which can be queued up so as not to slow the deposit process.

The last thing to note about events is that the indexer obeys the eprint edit-lock, so if someone has the resource locked, the events won't happen yet.

The indexer trys to execute queued events every 30 seconds and you can view the status of events and the indexer via the "status" button under the "System Tools" tab of the admin interface.

ScanFile event

An Event is just another type of plug-in thus you create it in a the archives cfg/plugins/EPrints/Plugin/Event/ folder.

Below is an event with a single sub which performs the needed operation, all this needs to be parsed is a file_id.

 package EPrints::Plugin::Event::ScanFile;
 
 @ISA = qw( EPrints::Plugin::Event );
 
 use strict;
 
 sub scanfile
 {
       my( $self, $file_id ) = @_;
 
       my $repository = $self->{repository};
 
       my $file = new EPrints::DataObj::File( $repository, $file_id );
 
       my $src_path = $file->get_local_copy;
 
       my $cmd = "file -i $src_path | awk '{split (\$0, a, \" \"); print a[2]}'";
 
       my $ret = `$cmd`;
 
       $ret =~ s/\r?\n//;
 
       if (defined $ret and (!($ret eq ""))) {
               $file->set_value("file_cmd_mime", $ret);
               $file->commit();
       }
 
 }

Triggers

Triggers in EPrints and like message queues, you register a function which is called when a trigger point is hit.

There are lots of triggers in EPrints which are defined in the EPrints/Const.pm file however not all of them have trigger points at the time of writing.

Ideally we would want to use the EP_TRIGGER_FILES_MODIFIED as our trigger, however this is one of the ones which currently has no trigger point. So instead we are going to register our code against the EP_TRIGGER_AFTER_COMMIT on the file dataset only.

There are two types of trigger:

  • Repository Triggers: General purpose triggers which should always be called with the same params and types.
  • Dataset Triggers: Similar to repository triggers except they operate on a single dataset which changes depending on the trigger point.

We are thus going to register our trigger against the file dataset using a dataset trigger.

Our Dataset Trigger

Triggers are registered in the repository config on load, thus the code can sit in our config file under cfg/cfg.d/.

Don't forget to reload the config to load the trigger.

 $c->add_dataset_trigger( "file", EP_TRIGGER_AFTER_COMMIT , sub {
       my ( %params ) = @_;
 
       my $repository = $params{repository};
 
       return undef if (!defined $repository);
 
       if (defined $params{dataobj}) {
               my $file = $params{dataobj};
               my $file_id = $file->value("fileid");
 
               $repository->dataset( "event_queue" )->create_dataobj({
                       pluginid => "Event::ScanFile",
                       action => "scanfile",
                       params => [$file_id],
               });
       }
 
});

Finish and test

Once again we can package up these files and package install them, beware of testing the dataset stuff on the database and leaving the fields there after you remove you file for packaging.

To test this is all working we can look in the indexer status to ensure the event it getting queued and not failing, hopefully on this page you won't see anything.

To see the resultant values which are written to the dataset you will need to either print them back out in the file citation, or hack your way into the mysql database for your repository and see if it is putting the values in the file table.

 select fileid,filename,file_cmd_mime from file where datasetid="document" order by fileid desc limit 10;