SWORD 2.0

From EPrints Documentation
Jump to: navigation, search

SWORD 2.0 is the default implimentation in EPrints 3.3, however there is a plugin to enable SWORD 1.3 (but see below)

Terminology

SWORD 2.0 uses some specific terms for specific meanings

  • collection The specific URL within the server for the data to go into. For EPrints this generally means inbox, review, archive, deleted - however for DSpace, there is a Collection concept; and Fedora has a similar RDF tag for defining collective groupings.
  • package The URI that identifies how a particular deposit has been wrapped up.
  • content-type A mime-type (usually vendor specific) which the server recognises as triggering a particular process for handling the deposit process (very important in CRUD systems).
    • (eg application/vnd.rjbroker)
  • servicedocument The document that the SWORD server can return to inform clients of what collections and what packages are understood by the service

Configuring SWORD

No configuration is required - SWORD 2.0 is enabled by default. There are some caveats you need to be aware of:

  • There is only one collection: /id/contents
    • All updates are done by naming the specific eprint ID
  • By default, new records are created in the "review" buffer
    • If the incoming request has the In-Progress header element set true, then the deposit will be put into the inbox
    • If the repository has been configured to skip_buffer then items will be put into the archive buffer rather than review

servicedocument

The servicedocument is an XML listing of what content-types (and packages) the server understands.

It no longer lists Q-Values.

The default location for the servicedocument is /sword-app/servicedocument

NOTE If you install the SWORD 1.3 plugin into EPrints 3.3, then /sword-app/servicedocument returns the SWORD 1.3 configuration, not the SWORD 2.0 information.

example

Below is an example framework of the servicedocument

<?xml version='1.0' encoding='UTF-8'?>
<service xmlns="http://www.w3.org/2007/app" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/"
         xmlns:dcterms="http://purl.org/dc/terms/">
  <workspace>
    <atom:title>3.3.5: Manage deposits</atom:title>
    <sword:version>2.0</sword:version>
    <collection href="http://devel.edina.ac.uk:1202/id/contents">
      <atom:title>Eprints</atom:title>
      <sword:mediation>true</sword:mediation>
    </collection>
  </workspace>
</service>

and the list of accepted content is given thus:

  <accept alternate="multipart-related">application/vnd.eprints.data+xml; charset=utf-8</accept>
  <acceptPackaging>http://eprints.org/ep2/data/2.0</acceptPackaging>
  <accept alternate="multipart-related">application/zip</accept>
  <accept alternate="multipart-related">application/x-gzip</accept>
  <acceptPackaging>http://purl.org/net/sword/package/SimpleZip</acceptPackaging>
  <accept alternate="multipart-related">application/vnd.openxmlformats-officedocument.wordprocessingml.document</accept>
  <accept alternate="multipart-related">application/vnd.openxmlformats</accept>
  <accept alternate="multipart-related">application/msword</accept>
  <accept alternate="multipart-related">application/vnd.rjbroker</accept>
  <acceptPackaging>http://opendepot.org/broker/1.0</acceptPackaging>
  <acceptPackaging>http://purl.org/net/sword/package/Binary</acceptPackaging>
  <accept alternate="multipart-related">application/octet-stream</accept>

Notice that there are a mixture of package and multi-part mime-types listed.

Writing your own Importer

In EPrints 3.3, all importers are in the same place, there is no longer a distinction between "SWORD" and anything else.

As with all non-core code, importers are in ~~eprints/lib/plugins/EPrints/Plugin/Import. As this is global to all repositories (in fact, all perl packages were visible to all repos under 3.2 too... the joys of Mod-Perl) plugins under lib are disabled by default

You can initially develop code in ~~eprints/perl-lib/EPrints/Plugin/Import/ however this is non-portable, and liable to get lost if (when!) you upgrade your installation of EPrints.

Assuming you start out creating an EPrints Package, then you need to read My_First_Bazaar_Package

  1. The actual package that handles the file being deposited
    • This live in ~~eprints/lib/plugins/EPrints/Plugin/Import/
    • For complex importers, you may end up writing multiple packages - its Perl.... TMTOWTDI
  1. You need to configure the specific repository to enable the new package
    • This lives in ~~eprints/lib/epm/<Bazaar_package_name>/cfg/cfg.d/

Configuration file

This is simply an enabling statement - for example:

# Ensure the plugin is enabled
$c->{plugins}->{"Import::RJ_Broker_2"}->{params}->{disable} = 0;

Note that, whilst the actual perl package is Eprints/Plugin/Import/Foo.pm, and will be called EPrints::Plugin::Import::Foo, the configuration file already knows its a plugin that's being enabled, so only needs Import::Foo

Importer Plugin

The importer plugin is a perl package that handles the deposited file, and creates a new record in the repository for the record deposited.

The Importer system is, as with all EPrints code, deeply hierarchical:

  1. EPrints::Plugin::Import::MyImporter will inherit most of its code from EPrints::Plugin::Import
  2. EPrints::Plugin::Import is a base class (one that is there to provide central functions), and inherits from EPrints::Plugin
  3. EPrints::Plugin is the base class for all plugins.

In practice, the best way to learn how importers are written is to look at existing importers (~~eprints/perl-lib/EPrints/Plugin/Import/...)

Basic framework

The very basic framework for an importer is just to register the importer with EPrints, and leave all functions to be handled by inheritence:

package EPrints::Plugin::Import::RJ_Broker_2;

use strict;

use EPrints::Plugin::Import::Binary;
use EPrints::Plugin::Import::Archive;

our @ISA = qw/ EPrints::Plugin::Import::Archive /;

sub new 
{
  my ( $class, %params ) = @_;

  my $self = $class->SUPER::new(%params);

  # The name to display
  $self->{name} = "RJ_Broker package";

  # Who can see it, and whether we show it
  $self->{visible}   = "all";
  $self->{advertise} = 1;

  # What the importer produces
  $self->{produce} = [qw( list/eprint dataobj/eprint )];

  # What mime-types can trigger this importer
  $self->{accept} = [ "application/vnd.rjbroker",
    "sword:http://opendepot.org/broker/1.0" ];

  return $self;
} ## end sub new

1;

This defines a few important things:

  $self->{name} = "RJ_Broker package";

is the name to display in public

  # What the importer produces
  $self->{produce} = [qw( list/eprint dataobj/eprint )];

defines what the importer actually returns (some handle specific files [.pdf, .doc, .mp3], others may import several records in one go [eg EndNote files])

  # What mime-types can trigger this importer
  $self->{accept} = [ "application/vnd.rjbroker",
    "sword:http://opendepot.org/broker/1.0" ];

Defines what content-types andh/or package-types trigger the use of this importer. (two importers registering to handle the same content-type is A bad Thing[tm]

Whilst this is pretty, it's actually pretty useless as all it will do is replicate what Import::Archive does. To be useful, it needs to somehow read in some metadata and maybe attach some files.

The first important task will be to handle the file deposited. For this, you need to create a function called input_fh, and it will have a framework something like:

sub input_fh 
{
  my ( $plugin, %opts ) = @_;

  my $fh      = $opts{fh};
  my $dataset = $opts{dataset};
  my $repo    = $plugin->{session};

  # get the type of file (should be zip) and the filename ($zipfile) for
  # the file just deposited.
  # (local method)
  my $zipfile = $plugin->upload_archive($fh);

  ## Do magic to get the XML file with the metadata
  my $epdata = $plugin->parse_epdcx_xml_data($xml_data);

  my $dataobj = $plugin->epdata_to_dataobj( $dataset, $epdata );
  if ( defined $dataobj ) {
    push @ids, $dataobj->get_id;
  }

  return EPrints::List->new(
    dataset => $dataset,
    session => $repo,
    ids     => \@ids
  );

}

sub upload_archive 
{
  my ( $self, $fh ) = @_;

  use bytes;

  binmode($fh);

  my $zipfile = File::Temp->new();
  binmode($zipfile);

  my $rc;
  my $lead;
  while ( $rc = sysread( $fh, my $buffer, 4096 ) ) {
    $lead = $buffer if !defined $lead;
    syswrite( $zipfile, $buffer );
  }
  EPrints->abort("Error reading from file handle: $!") if !defined $rc;

  return $zipfile;
} ## end sub upload_archive

sub parse_epdcx_xml_data
{
  my ( $plugin, $xml ) = @_;

  my $epdata         = {};
 
  ## parse the XML as needed...

  return $epdata;
} ## end sub parse_epdcx_xml_data

Depositing (from Perl)

A SWORD deposit is, at its most basic level, just an HTTP POST request, so can be scripted fairly easily.

This is an example of an initial deposit, where the content being posted is a bespoke format

   # $ep is eprint to transfer
   my $ua = LWP::UserAgent->new;
   my $auth = "Basic " . MIME::Base64::encode( "$username:$password", '' );

   my %headers = (
      'X-No-Op'             => 'false',
      'X-Verbose'           => 'true',
      'Content-Disposition' => "filename=$filename",  # The name of the "file" to be importer things its reading
      'Content-Type'        => $mime,                 # This triggers what parses the content.  eg: application/vnd.rjbroker
      'User-Agent'          => 'OA-RJ Broker v0.2',
      'Authorization'       => $auth,
   );

   if ($in_progress) {
     $headers{'in_progress'} = 'true'                 # Pushes into "inbox" rather than "review"
   }

   my $url    = "${host}${collection}";               # eg: http://eprints.example.com/id/contents
   my $buffer = $ep->export($exporter)                # eg: Bespoke_Export_Routine (as in EPrints::Export::Bespoke_Export_Routine)

   my $r = $ua->post( $url, %headers, Content => $buffer );
   if ( $r->is_success ) {
     # Transferred
     my $content = $r->content;
     my $return_id;
     if ( $content =~ m#<id>([^<]+)</id># ) {
        $return_id = $1 if $1;
     }
   } else {
     # fail
   }