Difference between revisions of "Import and Export Plug-ins"

From EPrints Documentation
Jump to: navigation, search
(Created page with 'Import and Export Plug-ins are all about taking an object, de-serialising it into an eprint (or other object in the EPrints system) and re-serialising it back into an object to b…')
 
Line 2: Line 2:
  
 
We would typically advocate that every import plug-in should have a corresponding export plug-in and visa-versa.
 
We would typically advocate that every import plug-in should have a corresponding export plug-in and visa-versa.
 +
 +
=Input Data=
 +
 +
We are going to attempt to import the following XML record into our system. If you know the format of this record all the better else we'll leave it as a surprise for now.
 +
 +
  <?xml version="1.0" encoding="UTF-8"?>
 +
 
 +
  <package xmlns="http://www.idpf.org/2007/opf" unique-identifier="EPB-UUID" version="2.0">
 +
  <metadata xmlns:opf="http://www.idpf.org/2007/opf"
 +
            xmlns:dc="http://purl.org/dc/elements/1.1/">
 +
      <dc:title>The Adventures of Sherlock Holmes</dc:title>
 +
      <dc:creator opf:role="aut" opf:file-as="Doyle, Sir Arthur Conan">Sir Arthur Conan Doyle</dc:creator>
 +
      <dc:publisher>epubBooks (www.epubbooks.com)</dc:publisher>
 +
      <dc:date opf:event="epub-publication">2010-06-15</dc:date>
 +
      <dc:subject>Crime/Detective</dc:subject>
 +
      <dc:subject>Short Stories</dc:subject>
 +
      <dc:source>Project Gutenberg</dc:source>
 +
      <dc:rights>
 +
            Provided for free by epubBooks.com. Not for commercial use.
 +
            This EPUB eBook is released under a Creative Commons (BY-NC-ND/3.0) Licence.
 +
            Source text and images are in the Public Domain.
 +
      </dc:rights>
 +
      <dc:identifier id="EPB-UUID">urn:uuid:FF946905-6C08-1014-90EA-E81F8523F0DC</dc:identifier>
 +
      <dc:language>en-gb</dc:language>
 +
  </metadata>
 +
  <manifest>
 +
      <!-- Data not included -->
 +
  </manifest>
 +
  <spine>
 +
      <!-- Data not included -->
 +
  </spine>
 +
  </package>
 +
 +
=Importing=
 +
 +
To import this record we need a fully featured import plug-in which is capable of handle both file and raw XML input, we are also going to use an XSLT plug-ins to make the process even easier.
 +
 +
==Stage 1: Create the files==
 +
 +
We need the following:
 +
 +
* cfg/plugins/EPrints/Plugin/Import/OPFXML.pm
 +
* cfg/plugins/EPrints/Plugin/Import/XSLT/DC.xsl
 +
 +
==OPFXML.pm==
 +
 +
This will be our master import plug-in to handle the entire chunk of XML. We will then use the DC plug-in to map the metadata into an eprint we can use later.
 +
 +
Since this is the master plug-in this requires quiet a sizeable amount of code:
 +
 +
  package EPrints::Plugin::Import::OPFXML;
 +
 
 +
  use strict;
 +
 
 +
  # Declare Namespaces
 +
  our $DC_NS = "http://purl.org/dc/elements/1.1/";
 +
  our $DCTERMS_NS = "http://purl.org/dc/terms/";
 +
  our $OPF_NS = "http://www.idpf.org/2007/opf";
 +
 
 +
  # This is just a variant of the DefaultXML plug-in
 +
  use EPrints::Plugin::Import::DefaultXML;
 +
 
 +
  our @ISA = qw/ EPrints::Plugin::Import::DefaultXML /;
 +
 
 +
  sub new
 +
  {
 +
        my( $class, %params ) = @_;
 +
 
 +
        my $self = $class->SUPER::new(%params);
 +
 
 +
        $self->{name} = "OPF Resource";
 +
        # Make it visible on the import menu and elsewhere
 +
        $self->{visible} = "all";
 +
        $self->{produce} = [ 'list/eprint', 'dataobj/eprint' ];
 +
 
 +
        # We shall be soon be building in the functionality to recognise XML types on import which a tag such as the following will offer.
 +
        # $self->{schema} = "http://www.idpf.org/2007/opf"
 +
 
 +
        my $rc = EPrints::Utils::require_if_exists("MIME::Types");
 +
 
 +
        unless( $rc )
 +
        {
 +
                $self->{visible} = "";
 +
                $self->{error} = "Failed to load required module MIME::Types";
 +
        }
 +
 
 +
        return $self;
 +
  }
 +
 
 +
  # Input File Handle Method, for when files are uploaded
 +
  sub input_fh
 +
  {
 +
        my( $plugin, %opts ) = @_;
 +
 
 +
        my $fh = $opts{"fh"};
 +
 
 +
        my $xml = join "", <$fh>;
 +
 
 +
        my $list;
 +
 
 +
        if( $xml =~ /^<\?xml/ )
 +
        {
 +
                $list = $plugin->input_fh_xml( $xml, %opts );
 +
        }
 +
        else
 +
        {
 +
                $list = $plugin->input_fh_list( $xml, %opts );
 +
        }
 +
 
 +
        $list ||= EPrints::List->new(
 +
                        dataset => $opts{dataset},
 +
                        session => $plugin->{session},
 +
                        ids => [] );
 +
 
 +
        return $list;
 +
  }
 +
 
 +
  # Handle direct XML input
 +
  sub input_fh_xml
 +
  {
 +
        my( $plugin, $xml, %opts ) = @_;
 +
 
 +
        my $doc = EPrints::XML::parse_xml_string( $xml );
 +
 
 +
        my $dataobj = $plugin->xml_to_dataobj( $opts{dataset}, $doc->documentElement );
 +
 
 +
        EPrints::XML::dispose( $doc );
 +
 
 +
        return EPrints::List->new(
 +
                        dataset => $opts{dataset},
 +
                        session => $plugin->{session},
 +
                        ids => [$dataobj->get_id] );
 +
  }
 +
 
 +
  # Go grab input from a URL
 +
  sub input_fh_list
 +
  {
 +
        my( $plugin, $url, %opts ) = @_;
 +
 
 +
        $url =~ s/\s+//g;
 +
 
 +
        my $tmpfile = File::Temp->new;
 +
 
 +
        my $r = EPrints::Utils::wget( $plugin->{session}, $url, $tmpfile );
 +
        seek($tmpfile,0,0);
 +
 
 +
        if( $r->is_error )
 +
        {
 +
                $plugin->error( "Error reading resource from $url: ".$r->code." ".$r->message );
 +
                return;
 +
        }
 +
 
 +
        my @ids;
 +
 
 +
        while(my $url = <$tmpfile>)
 +
        {
 +
                $url =~ s/\s+//g;
 +
                next unless $url =~ /^http/;
 +
 
 +
                my $doc;
 +
                eval { $doc = EPrints::XML::parse_url( $url ) };
 +
                if( $@ )
 +
                {
 +
                        $plugin->warning( "Error parsing: $url\n" );
 +
                }
 +
 
 +
                my $dataobj = $plugin->xml_to_dataobj( $opts{dataset}, $doc->documentElement );
 +
 
 +
                EPrints::XML::dispose( $doc );
 +
 
 +
                if( defined $dataobj )
 +
                {
 +
                        push @ids, $dataobj->get_id;
 +
                }
 +
        }
 +
         
 +
        return EPrints::List->new(
 +
                        dataset => $opts{dataset},
 +
                        session => $plugin->{session},
 +
                        ids => \@ids );
 +
  }
 +
  # Translate this XML into an EPrint
 +
  sub xml_to_dataobj
 +
  {
 +
        # $xml is the PubmedArticle element
 +
        my( $plugin, $dataset, $xml ) = @_;
 +
 
 +
        my $session = $plugin->{session};
 +
 
 +
        # Locate the metadata element
 +
        my $metadata = $xml->getElementsByTagNameNS( $OPF_NS, "metadata" )->[0];
 +
 
 +
        # Load the DC plugin
 +
        my $dc_plugin = $session->plugin( "Import::XSLT::DC",
 +
                processor => $plugin->{processor},
 +
                dataset => $dataset,
 +
        );
 +
 
 +
        # Spew the metadata element to a temp file
 +
        my $tmpfile2 = File::Temp->new;
 +
        print $tmpfile2 $metadata->toString();
 +
        seek($tmpfile2,0,0);
 +
 
 +
        # Parse the file using the plug-in to get back a list of eprints
 +
        my $list = $dc_plugin->input_fh( fh => $tmpfile2, dataset => $dataset );
 +
 
 +
        my( $eprint ) = $list->get_records( 0, 1 );
 +
 
 +
        return $eprint;
 +
  }
 +
 
 +
  1;
 +
 +
==XSLT/DC.xsl==
 +
 +
Using XSLT is really quiet a nice way of parsing data and well documented so I shouldn't have to say much about how this works here.
 +
 +
  <?xml version="1.0"?>
 +
 
 +
  <!-- DC Transformation -->
 +
 
 +
  <xsl:stylesheet
 +
        version="1.0"
 +
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 +
        ept:name="DC"
 +
        ept:visible="all"
 +
        ept:advertise="1"
 +
        ept:sourceNamespace="http://purl.org/dc/elements/1.1/"
 +
        ept:targetNamespace="http://eprints.org/ep2/data/2.0"
 +
        ept:produce="list/eprint"
 +
        xmlns:ept="http://eprints.org/ep2/xslt/1.0"
 +
        xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
 +
        xmlns:dc="http://purl.org/dc/elements/1.1/"
 +
        xmlns:opf="http://www.idpf.org/2007/opf"
 +
        xmlns="http://eprints.org/ep2/data/2.0"
 +
  >
 +
 
 +
  <xsl:output method="xml" indent="yes" encoding="utf-8"/>
 +
  <xsl:template match="/">
 +
        <eprints>
 +
        <eprint>
 +
        <eprint_status>inbox</eprint_status>
 +
        <creators>
 +
        <xsl:for-each select="//dc:creator">
 +
                <item><xsl:apply-templates select="." /></item>
 +
        </xsl:for-each>
 +
        </creators>
 +
        <corp_authors>
 +
        <xsl:for-each select="//dc:source">
 +
                <item><xsl:apply-templates select="." /></item>
 +
        </xsl:for-each>
 +
        </corp_authors>
 +
        <subjects>
 +
        <xsl:for-each select="//dc:subject">
 +
                <item><xsl:apply-templates select="." /></item>
 +
        </xsl:for-each>
 +
        </subjects>
 +
        <xsl:apply-templates select="./*" />
 +
        </eprint>
 +
        </eprints>
 +
  </xsl:template>
 +
 
 +
  <xsl:template match="dc:title">
 +
        <title><xsl:value-of select="." /></title>
 +
  </xsl:template>
 +
 
 +
  <xsl:template match="dc:publisher">
 +
        <publisher><xsl:value-of select="." /></publisher>
 +
  </xsl:template>
 +
 
 +
  <xsl:template match="dc:rights">
 +
        <copyright_holders><xsl:value-of select="." /></copyright_holders>
 +
  </xsl:template>
 +
 
 +
  <xsl:template match="dc:date">
 +
        <date><xsl:value-of select="." /></date>
 +
  </xsl:template>
 +
 
 +
  <xsl:template match="dc:identifier">
 +
        <id_number><xsl:value-of select="." /></id_number>
 +
  </xsl:template>
 +
 
 +
  <!--Ignored at this level-->
 +
  <xsl:template match="dc:language|dc:subject|dc:source|dc:creator"/>
 +
 
 +
  </xsl:stylesheet>

Revision as of 14:11, 30 November 2010

Import and Export Plug-ins are all about taking an object, de-serialising it into an eprint (or other object in the EPrints system) and re-serialising it back into an object to be supported.

We would typically advocate that every import plug-in should have a corresponding export plug-in and visa-versa.

Input Data

We are going to attempt to import the following XML record into our system. If you know the format of this record all the better else we'll leave it as a surprise for now.

 <?xml version="1.0" encoding="UTF-8"?>
 
 <package xmlns="http://www.idpf.org/2007/opf" unique-identifier="EPB-UUID" version="2.0">
  <metadata xmlns:opf="http://www.idpf.org/2007/opf"
            xmlns:dc="http://purl.org/dc/elements/1.1/">
     <dc:title>The Adventures of Sherlock Holmes</dc:title>
     <dc:creator opf:role="aut" opf:file-as="Doyle, Sir Arthur Conan">Sir Arthur Conan Doyle</dc:creator>
     <dc:publisher>epubBooks (www.epubbooks.com)</dc:publisher>
     <dc:date opf:event="epub-publication">2010-06-15</dc:date>
     <dc:subject>Crime/Detective</dc:subject>
     <dc:subject>Short Stories</dc:subject>
     <dc:source>Project Gutenberg</dc:source>
     <dc:rights>
           Provided for free by epubBooks.com. Not for commercial use.
           This EPUB eBook is released under a Creative Commons (BY-NC-ND/3.0) Licence.
           Source text and images are in the Public Domain.
     </dc:rights>
     <dc:identifier id="EPB-UUID">urn:uuid:FF946905-6C08-1014-90EA-E81F8523F0DC</dc:identifier>
     <dc:language>en-gb</dc:language>
  </metadata>
  <manifest>
  </manifest>
  <spine>
  </spine>
 </package>

Importing

To import this record we need a fully featured import plug-in which is capable of handle both file and raw XML input, we are also going to use an XSLT plug-ins to make the process even easier.

Stage 1: Create the files

We need the following:

  • cfg/plugins/EPrints/Plugin/Import/OPFXML.pm
  • cfg/plugins/EPrints/Plugin/Import/XSLT/DC.xsl

OPFXML.pm

This will be our master import plug-in to handle the entire chunk of XML. We will then use the DC plug-in to map the metadata into an eprint we can use later.

Since this is the master plug-in this requires quiet a sizeable amount of code:

 package EPrints::Plugin::Import::OPFXML;
 
 use strict;
 
 # Declare Namespaces
 our $DC_NS = "http://purl.org/dc/elements/1.1/";
 our $DCTERMS_NS = "http://purl.org/dc/terms/";
 our $OPF_NS = "http://www.idpf.org/2007/opf";
 
 # This is just a variant of the DefaultXML plug-in
 use EPrints::Plugin::Import::DefaultXML;
 
 our @ISA = qw/ EPrints::Plugin::Import::DefaultXML /;
 
 sub new
 {
       my( $class, %params ) = @_;
 
       my $self = $class->SUPER::new(%params);
 
       $self->{name} = "OPF Resource";
       # Make it visible on the import menu and elsewhere
       $self->{visible} = "all";
       $self->{produce} = [ 'list/eprint', 'dataobj/eprint' ];
 
       # We shall be soon be building in the functionality to recognise XML types on import which a tag such as the following will offer.
       # $self->{schema} = "http://www.idpf.org/2007/opf"
 
       my $rc = EPrints::Utils::require_if_exists("MIME::Types");
 
       unless( $rc )
       {
               $self->{visible} = "";
               $self->{error} = "Failed to load required module MIME::Types";
       }
 
       return $self;
 }
 
 # Input File Handle Method, for when files are uploaded
 sub input_fh
 {
       my( $plugin, %opts ) = @_;
 
       my $fh = $opts{"fh"};
 
       my $xml = join "", <$fh>;
 
       my $list;
 
       if( $xml =~ /^<\?xml/ )
       {
               $list = $plugin->input_fh_xml( $xml, %opts );
       }
       else
       {
               $list = $plugin->input_fh_list( $xml, %opts );
       }
 
       $list ||= EPrints::List->new(
                       dataset => $opts{dataset},
                       session => $plugin->{session},
                       ids => [] );
 
       return $list;
 }
 
 # Handle direct XML input
 sub input_fh_xml
 {
       my( $plugin, $xml, %opts ) = @_;
 
       my $doc = EPrints::XML::parse_xml_string( $xml );
 
       my $dataobj = $plugin->xml_to_dataobj( $opts{dataset}, $doc->documentElement );
 
       EPrints::XML::dispose( $doc );
 
       return EPrints::List->new(
                       dataset => $opts{dataset},
                       session => $plugin->{session},
                       ids => [$dataobj->get_id] );
 }
 
 # Go grab input from a URL
 sub input_fh_list
 {
       my( $plugin, $url, %opts ) = @_;
 
       $url =~ s/\s+//g;
 
       my $tmpfile = File::Temp->new;
 
       my $r = EPrints::Utils::wget( $plugin->{session}, $url, $tmpfile );
       seek($tmpfile,0,0);
 
       if( $r->is_error )
       {
               $plugin->error( "Error reading resource from $url: ".$r->code." ".$r->message );
               return;
       }
 
       my @ids;
 
       while(my $url = <$tmpfile>)
       {
               $url =~ s/\s+//g;
               next unless $url =~ /^http/;
 
               my $doc;
               eval { $doc = EPrints::XML::parse_url( $url ) };
               if( $@ )
               {
                       $plugin->warning( "Error parsing: $url\n" );
               }
 
               my $dataobj = $plugin->xml_to_dataobj( $opts{dataset}, $doc->documentElement );
 
               EPrints::XML::dispose( $doc );
 
               if( defined $dataobj )
               {
                       push @ids, $dataobj->get_id;
               }
       }
         
       return EPrints::List->new(
                       dataset => $opts{dataset},
                       session => $plugin->{session},
                       ids => \@ids );
 }
 # Translate this XML into an EPrint
 sub xml_to_dataobj
 {
       # $xml is the PubmedArticle element
       my( $plugin, $dataset, $xml ) = @_;
 
       my $session = $plugin->{session};
 
       # Locate the metadata element
       my $metadata = $xml->getElementsByTagNameNS( $OPF_NS, "metadata" )->[0];
 
       # Load the DC plugin
       my $dc_plugin = $session->plugin( "Import::XSLT::DC",
               processor => $plugin->{processor},
               dataset => $dataset,
       );
 
       # Spew the metadata element to a temp file
       my $tmpfile2 = File::Temp->new;
       print $tmpfile2 $metadata->toString();
       seek($tmpfile2,0,0);
 
       # Parse the file using the plug-in to get back a list of eprints
       my $list = $dc_plugin->input_fh( fh => $tmpfile2, dataset => $dataset );
 
       my( $eprint ) = $list->get_records( 0, 1 );
 
       return $eprint;
 }
 
 1;

XSLT/DC.xsl

Using XSLT is really quiet a nice way of parsing data and well documented so I shouldn't have to say much about how this works here.

 <?xml version="1.0"?>
 
 
 <xsl:stylesheet
       version="1.0"
       xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       ept:name="DC"
       ept:visible="all"
       ept:advertise="1"
       ept:sourceNamespace="http://purl.org/dc/elements/1.1/"
       ept:targetNamespace="http://eprints.org/ep2/data/2.0"
       ept:produce="list/eprint"
       xmlns:ept="http://eprints.org/ep2/xslt/1.0"
       xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:opf="http://www.idpf.org/2007/opf"
       xmlns="http://eprints.org/ep2/data/2.0"
 >
 
 <xsl:output method="xml" indent="yes" encoding="utf-8"/>
 <xsl:template match="/">
       <eprints>
       <eprint>
       <eprint_status>inbox</eprint_status>
       <creators>
       <xsl:for-each select="//dc:creator">
               <item><xsl:apply-templates select="." /></item>
       </xsl:for-each>
       </creators>
       <corp_authors>
       <xsl:for-each select="//dc:source">
               <item><xsl:apply-templates select="." /></item>
       </xsl:for-each>
       </corp_authors>
       <subjects>
       <xsl:for-each select="//dc:subject">
               <item><xsl:apply-templates select="." /></item>
       </xsl:for-each>
       </subjects>
       <xsl:apply-templates select="./*" />
       </eprint>
       </eprints>
 </xsl:template>
 
 <xsl:template match="dc:title">
       <title><xsl:value-of select="." /></title>
 </xsl:template>
 
 <xsl:template match="dc:publisher">
       <publisher><xsl:value-of select="." /></publisher>
 </xsl:template>
 
 <xsl:template match="dc:rights">
       <copyright_holders><xsl:value-of select="." /></copyright_holders>
 </xsl:template>
 
 <xsl:template match="dc:date">
       <date><xsl:value-of select="." /></date>
 </xsl:template>
 
 <xsl:template match="dc:identifier">
       <id_number><xsl:value-of select="." /></id_number>
 </xsl:template>
 
 <xsl:template match="dc:language|dc:subject|dc:source|dc:creator"/>
 
 </xsl:stylesheet>