StorageController

From EPrints Documentation
Jump to: navigation, search

The EPrints Storage Controller is the first of its kind in any repository software and enables the repository manager to use multiple storage platforms including local, institutional and cloud based. Using EPrints to manage the platforms means your resources become even easier to manage and preserve for future usage.

Current Testing and Working Plugins

  • Local Disk Storage
    • Legacy plug-in supporting all current repository deployments
  • Amazon S3 & Cloudfront
    • Enables storage of files in the cloud, the storage controller allows for direct provision of files from Amazon to your users thus conserving repository bandwidth.
  • Sun Honeycomb
    • Although a now retired platform, this plug-in works with the last release of the Honeycomb Software.

Influences

The main key influences for this work come from the current Open Storage and Interoperability movement which became very clear at the [Open Repositories Conference 2008|http://or08.ecs.soton.ac.uk]. Both movements are being well backed by both [Sun|http://www.sun.com] and [Microsoft|http://www.microsoft.com] who are looking at Hardware and Software solutions respectively. Many projects are also doing a lot of promotion in this area not just for their own gain but also for that of the community include:

  • The Sun Honeycomb Project - Are promoting the use of an Open Storage layer for repository softwares (Eprints, Fedora and DSpace)
  • [The Common Repository Interoperability Group|http://www.ukoln.ac.uk/repositories/digirep/index/CRIG] (CRIG) - Promote interopability through common APIs/Schemas and Open Storage.
  • [Preserv | http://preserv.eprints.org] - Promoting and Developing Preservation tools/techniques for digital data in intitutional repositories. Both Interoperability and Open Storage lend themselves well towards digital preservation.

How it Works

  • 1 New Core EPrint Object - FileObj

This contains all the file specific properties and a get_parent object which retrieves the Document object (from which the same method call returns the EPrint Object.

  • A Storage Controller

Is where the selection of the appropriate storage plug-in takes place based upon a set of user defined rules, e.g. files bigger than 1Gb use this stoage plug-in, user submitted files (pdf's) use these 3 plug-ins to be stored in several places etc etc.

  • A Set of Storage Plug-Ins

Endless amounts of plug-ins can be written to handle submission directly to local disk/remote servers/cloud services and even other repositories.


The Storage Controller & API (0.2 Alpha)

  NOTES: This module is where the selection process of storage plugin takes place. Currently there is no configuration capability
  on a per archive basis thus the controller can only be customised by editing this source file. At the moment it looks through the   
  storage plugins directory and uses the first storage plug-in it finds for everything! 

LOCATION: /lib/storage/default.xml

This module is the storage control layer which uses EPrints::Plugin::Storage plugins to support various storage back-ends. It enables the storage, retrieval and deletion of data streams. The maximum size of a stream is dependent on the back-end storage mechanism.

Each data stream is located at a repository unique location constructed from the data object id, bucket name and file name. This may be turned into a URI by the storage layer to achieve global uniqueness (e.g. by using the repository's hostname). Multiple Storage Mediums

The storage layer may make use of multiple storage back-ends. To assist locating the correct place to store and retrieve streams the API requires the EPrints object and a "bucket" name.

The EPrints object passed to the storage API may be used to choose different storage mediums or to add metadata to the stored stream (e.g. if the storage back end is another repository).

The bucket is a string that identifies classes of streams. For instance EPrints::DataObj::Document objects (currently) have "data" and "thumbnail" buckets for storing files and thumbnails respectively. Revisions

The storage layer may store multiple revisions located at the same filename.

If the storage medium supports revisioning it is expected that repeated store() calls to the same location will result in multiple revisions. A retrieve() call without any revision will always return the data stored in the last store() call.

METHODS

 $store = EPrints::Storage->new( $session )
   Create a new storage object for $session. Should not be used directly, see EPrints::Session.
  
 $success = $store->store( $fileobj, $filehandle )
   Read from and store all data from $filehandle for $fileobj. Returns false on error.
   
 $filehandle = $store->retrieve( $fileobj [, $revision ] )
   Retrieve a $filehandle to the object stored for $fileobj. If no $revision is specified returns the revision in $fileobj.
 
 $success = $store->delete( $fileobj [, $revision ] )
   Delete the object stored for $fileobj. If no $revision is specified deletes the revision in $fileobj.
  
 $filename = $store->get_local_copy( $fileobj [, $revision ] )
   Return the name of a local copy of the file (may be a File::Temp object). Will retrieve and cache the remote object if necessary.
 
 $url = $store->get_remote_copy( $fileobj, $sourceid )
   Returns an alternative URL for this file (must be publicly accessible). Returns undef if no such copy is available.
   

Writing A Storage Plugin

Implement the methods above for your chosen storage medium. Note that the plug-in is resposible for any mapping from fileobj to a URI or path where the file exists. This leaves URI generation flexible however it must remain replicatable from the fileobj.

Currently a local storage plug-in exists which replicates all legacy behaviour of the eprints storage system prior to containing a seporate storage layer.

As a quick guide to writing a storage plug-in the following provides a template:

  Package EPrints::Plugin::Storage::NAME;
    
  use URI;
  use URI::Escape;
 
  use EPrints::Plugin::Storage;
 
  @ISA = ( "EPrints::Plugin::Storage" );
 
  use strict;
   
  sub new
  {
       my( $class, %params ) = @_;
  
       my $self = $class->SUPER::new( %params );
 
       $self->{name} = "NAME Storage Plugin";
  
       return $self;
  }
 
  sub store
  {
       my( $self, $fileobj, $fh ) = @_;
       
       ...
  
       return $success;
  }
  
  sub retrieve
  {
       my( $self, $fileobj, $revision ) = @_;
   
       ...
 
       return $in_file_handle;                        
  }
  
  sub delete
  {
       my( $self, $fileobj, $revision ) = @_;
       
       ...
  
       return $success;
  }              
  
  # Optional methods
  
  # get_local_copy() is provided by Storage::Plugin so does
  # not have to be implemented. If the file is stored
  # locally it may be more efficient to return it's filename
  # here rather than copying to a temporary file.
  sub get_local_copy
  {
       my( $self, $fileobj, $revision ) = @_;
       
       ...
       
       return $filename;
  }
  # get_remote_copy() allows the platform storing the file to directly
  # provide the file to the end user.
  sub get_remote_copy
  {
       my( $self, $fileobj, $uri ) = @_;
   
       ...

       return $uri;
  }