New Features in EPrints 3.2

From EPrints Documentation
Revision as of 11:50, 14 October 2008 by DaveTarrant (talk | contribs)
Jump to: navigation, search

New Features

Plug-in Based Storage Layer

The idea here is to separate the storage layer from the direct control of the repository and instead enable plugins to be written conforming to a common API which store and retrieve the relevant data upon request.

Influences

The main key influences for this work come from the current Open Storage and Interoperability movement which became very clear at the [Open Repositories Conference 2008|http://or08.ecs.soton.ac.uk]. Both movements are being well backed by both [Sun|http://www.sun.com] and [Microsoft|http://www.microsoft.com] who are looking at Hardware and Software solutions respectively. Many projects are also doing a lot of promotion in this area not just for their own gain but also for that of the community include:

  • The Sun Honeycomb Project - Are promoting the use of an Open Storage layer for repository softwares (Eprints, Fedora and DSpace)
  • [The Common Repository Interoperability Group|http://www.ukoln.ac.uk/repositories/digirep/index/CRIG] (CRIG) - Promote interopability through common APIs/Schemas and Open Storage.
  • [Preserv | http://preserv.eprints.org] - Promoting and Developing Preservation tools/techniques for digital data in intitutional repositories. Both Interoperability and Open Storage lend themselves well towards digital preservation.

How it Works

  • 1 New Core EPrint Object - FileObj

This contains all the file specific properties and a get_parent object which retrieves the Document object (from which the same method call returns the EPrint Object.

  • A Storage Controller

Is where the selection of the appropriate storage plug-in takes place based upon a set of user defined rules, e.g. files bigger than 1Gb use this stoage plug-in, user submitted files (pdf's) use these 3 plug-ins to be stored in several places etc etc.

  • A Set of Storage Plug-Ins

Endless amounts of plug-ins can be written to handle submission directly to local disk/remote servers/cloud services and even other repositories.

The File Object - FileObj

Location: perl_lib/EPrints/DataObj/File.pm

This class contains the technical metadata associated with a file. A file is a sequence of bytes stored in the storage layer (a "stored object"). Utility methods for storing and retrieving the stored object from the storage layer are made available.

Revision numbers on File work slightly differently to other objects. A File is only revised when it's stored object is changed and not when changes to it's metadata are made.

This class is a subclass of EPrints::DataObj::SubObject

CORE FIELDS

 fileid
   Unique identifier for this file.
  
 rev_number (int)
   The number of the current revision of this file.
 
 datasetid
   Id of the dataset of the parent object.
 
 objectid
   Id of the parent object.
 
 bucket
   Name of the bucket the file is in.
 
 filename
   Name of the file (may contain directory separators).
 
 mime_type
   MIME type of the file (e.g. "image/png").
 
 hash
   Check sum of the file.
 
 hash_type
   Name of check sum algorithm used (e.g. "MD5").
 
 filesize
   Size of the file in bytes.
 
 mtime
   Last modification time of the file.

METHODS

Constructor Methods

 $dataobj = EPrints::DataObj::File->new_from_filename( $session, $dataobj, $bucket, $filename )
   Convenience method to get an existing File object for $filename stored in the $bucket in $dataobj.
   Returns undef if no such record exists.
  
 $dataobj = EPrints::DataObj::File->create_from_data( $session, $data [, $dataset ] )
   Create a new File record using $data. If "_filehandle" is defined in $data it will be read from and stored.

Class Methods

 $thing = EPrints::DataObj::File->get_system_field_info
   Core fields.
  
 $dataset = EPrints::DataObj::File->get_dataset_id
  Returns the id of the EPrints::DataSet object to which this record belongs.
  
 $defaults = EPrints::DataObj::File->get_defaults( $session, $data )
   Return default values for this object based on the starting data.

Object Methods

 $success = $stored->remove
   Remove the stored file. Deletes all revisions of the contained object.
 
 $filename = $file->get_local_copy( [ $revision ] )
   Return the name of a local copy of the file (may be a File::Temp object).
   Will retrieve and cache the remote object if necessary.
  
 $fh = $stored->get_fh( [ $revision ] )
   Retrieve a file handle to the stored file (this is a wrapper around EPrints::Storage::retrieve).
  
 $success = $file->add_file( $filepath, $filename [, $preserve_path ] )
   Read and store the contents of $filepath at $filename.
   If $preserve_path is untrue will strip any leading path in $filename.
  
 $success = $file->upload( $filehandle, $filename, $filesize [, $preserve_path ] )
   Read and store the data from $filehandle at $filename at the next revision number.
   If $preserve_path is untrue will strip any leading path in $filename.
 
 $success = $stored->write_copy( $filename [, $revision] )
   Write a copy of this file to $filename.
   Returns true if the written file contains the same number of bytes as the stored file.
 
 $success = $stored->write_copy_fh( $filehandle [, $revision ] )
   Write a copy of this file to $filehandle.
 
 $md5 = $stored->generate_md5
   Calculates and returns the MD5 for this file.

The Storage Controller & API (0.2 Alpha)

  NOTES: This module is where the selection process of storage plugin takes place. Currently there is no configuration capability
  on a per archive basis thus the controller can only be customised by editing this source file. At the moment it looks through the   
  storage plugins directory and uses the first storage plug-in it finds for everything! 

LOCATION: /lib/storage/default.xml

This module is the storage control layer which uses EPrints::Plugin::Storage plugins to support various storage back-ends. It enables the storage, retrieval and deletion of data streams. The maximum size of a stream is dependent on the back-end storage mechanism.

Each data stream is located at a repository unique location constructed from the data object id, bucket name and file name. This may be turned into a URI by the storage layer to achieve global uniqueness (e.g. by using the repository's hostname). Multiple Storage Mediums

The storage layer may make use of multiple storage back-ends. To assist locating the correct place to store and retrieve streams the API requires the EPrints object and a "bucket" name.

The EPrints object passed to the storage API may be used to choose different storage mediums or to add metadata to the stored stream (e.g. if the storage back end is another repository).

The bucket is a string that identifies classes of streams. For instance EPrints::DataObj::Document objects (currently) have "data" and "thumbnail" buckets for storing files and thumbnails respectively. Revisions

The storage layer may store multiple revisions located at the same filename.

If the storage medium supports revisioning it is expected that repeated store() calls to the same location will result in multiple revisions. A retrieve() call without any revision will always return the data stored in the last store() call.

METHODS

 $store = EPrints::Storage->new( $session )
   Create a new storage object for $session. Should not be used directly, see EPrints::Session.
  
 $success = $store->store( $fileobj, $filehandle )
   Read from and store all data from $filehandle for $fileobj. Returns false on error.
   
 $filehandle = $store->retrieve( $fileobj [, $revision ] )
   Retrieve a $filehandle to the object stored for $fileobj. If no $revision is specified returns the revision in $fileobj.
 
 $success = $store->delete( $fileobj [, $revision ] )
   Delete the object stored for $fileobj. If no $revision is specified deletes the revision in $fileobj.
  
 $filename = $store->get_local_copy( $fileobj [, $revision ] )
   Return the name of a local copy of the file (may be a File::Temp object). Will retrieve and cache the remote object if necessary.
   
 $size = $store->get_size( $fileobj [, $revision ] )
   Return the $size (in bytes) of the object stored at $fileobj. If no $revision is specified returns the size of the revision in $fileobj.
 
 @revisions = $store->get_revisions( $fileobj )
   Return a list of available revision numbers for $fileobj, in order from latest to oldest.

Writing A Storage Plugin

Implement the methods above for your chosen storage medium. Note that the plug-in is resposible for any mapping from fileobj to a URI or path where the file exists. This leaves URI generation flexible however it must remain replicatable from the fileobj.

Currently a local storage plug-in exists which replicates all legacy behaviour of the eprints storage system prior to containing a seporate storage layer.

As a quick guide to writing a storage plug-in the following provides a template:

  Package EPrints::Plugin::Storage::NAME;
    
  use URI;
  use URI::Escape;
 
  use EPrints::Plugin::Storage;
 
  @ISA = ( "EPrints::Plugin::Storage" );
 
  use strict;
   
  sub new
  {
       my( $class, %params ) = @_;
  
       my $self = $class->SUPER::new( %params );
 
       $self->{name} = "NAME Storage Plugin";
  
       return $self;
  }
 
  sub store
  {
       my( $self, $fileobj, $fh ) = @_;
       
       ...
  
       return $success;
  }
  
  sub retrieve
  {
       my( $self, $fileobj, $revision ) = @_;
   
       ...
 
       return $in_file_handle;                        
  }
  
  sub delete
  {
       my( $self, $fileobj, $revision ) = @_;
       
       ...
  
       return $success;
  }        
 
  sub get_size
  {
       my( $self, $fileobj, $revision ) = @_;
       
       ...
  
       return $size;
  }        
  
  # Optional methods
  
  # get_local_copy() is provided by Storage::Plugin so does
  # not have to be implemented. If the file is stored
  # locally it may be more efficient to return it's filename
  # here rather than copying to a temporary file.
  
  sub get_local_copy
  {
       my( $self, $fileobj, $revision ) = @_;
       
       ...
       
       return $filename;
  }
  
  # (get_revisions is yet to be finalised)
  
  sub get_revisions
  {
       my( $self, $fileobj ) = @_;
       
       ...
  
       return @revisions;
  }