Difference between revisions of "Import and Export Plug-ins"
DaveTarrant (talk | contribs) (→Testing) |
DaveTarrant (talk | contribs) m (→Exporting) |
||
(9 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | [[Category:EPrints_Bazaar]] | ||
Import and Export Plug-ins are all about taking an object, de-serialising it into an eprint (or other object in the EPrints system) and re-serialising it back into an object to be supported. | Import and Export Plug-ins are all about taking an object, de-serialising it into an eprint (or other object in the EPrints system) and re-serialising it back into an object to be supported. | ||
We would typically advocate that every import plug-in should have a corresponding export plug-in and visa-versa. | We would typically advocate that every import plug-in should have a corresponding export plug-in and visa-versa. | ||
+ | |||
+ | =Enable the Plugin's= | ||
+ | |||
+ | Create the package directory that will contain the package's configuration file: | ||
+ | |||
+ | $ mkdir -p lib/epm/io_exercise/cfg/cfg.d | ||
+ | |||
+ | Create a configuration file that enables the plugins in this package - this file is copied into the repository when the package is enabled: | ||
+ | |||
+ | $ gedit lib/epm/hello_world/cfg/cfg.d/io_exercise.pl | ||
+ | |||
+ | $c->{plugins}{"Import::OPFXML"}{params}{disable} = 0; | ||
+ | $c->{plugins}{"Import::XSLT::DC"}{params}{disable} = 0; | ||
=Input Data= | =Input Data= | ||
Line 43: | Line 57: | ||
We need the following: | We need the following: | ||
− | * | + | * lib/plugins/EPrints/Plugin/Import/OPFXML.pm |
− | * | + | * lib/plugins/EPrints/Plugin/Import/XSLT/DC.xsl |
===OPFXML.pm=== | ===OPFXML.pm=== | ||
Line 78: | Line 92: | ||
# Functionality to recognise XML types on import by recognising the base namespace, works with the sword packaging format, dc:conformsTo or similar. | # Functionality to recognise XML types on import by recognising the base namespace, works with the sword packaging format, dc:conformsTo or similar. | ||
− | $self->{xmlns} = "http://www.idpf.org/2007/opf" | + | $self->{xmlns} = "http://www.idpf.org/2007/opf"; |
my $rc = EPrints::Utils::require_if_exists("MIME::Types"); | my $rc = EPrints::Utils::require_if_exists("MIME::Types"); | ||
Line 133: | Line 147: | ||
dataset => $opts{dataset}, | dataset => $opts{dataset}, | ||
session => $plugin->{session}, | session => $plugin->{session}, | ||
− | ids => [$dataobj->get_id] ); | + | ids => [defined($dataobj) ? $dataobj->get_id : ()] ); |
} | } | ||
Line 199: | Line 213: | ||
dataset => $dataset, | dataset => $dataset, | ||
); | ); | ||
+ | |||
+ | $dc_plugin->{Handler} = $plugin->{Handler}; | ||
+ | $dc_plugin->{parse_only} = $plugin->{parse_only}; | ||
# Spew the metadata element to a temp file | # Spew the metadata element to a temp file | ||
Line 300: | Line 317: | ||
apache2ctl restart && tail -f /var/log/apache2/error.log | apache2ctl restart && tail -f /var/log/apache2/error.log | ||
− | Once you have the import screen, copy and paste the XML | + | Once you have the import screen, copy and paste the XML from the top of this exercise into it and see if it creates you an EPrint (which you can see back at the manage deposits screen) |
=Exporting= | =Exporting= | ||
Line 306: | Line 323: | ||
So now we have imported our EPrint from this OPF format we need to be able to put it back into this format for exporting, this has the advantage that you can also export ALL other eprints in this format. | So now we have imported our EPrint from this OPF format we need to be able to put it back into this format for exporting, this has the advantage that you can also export ALL other eprints in this format. | ||
− | We did most of the work on the importing so now it's your turn below is the basis of the Export plug-in which you can use to get started. To give you a clue this export plug-in needs to be located in | + | We did most of the work on the importing so now it's your turn below is the basis of the Export plug-in which you can use to get started. To give you a clue this export plug-in needs to be located in lib/plugins/EPrints/Plugin/Export/OPFXML.pm |
package EPrints::Plugin::Export::OPFXML; | package EPrints::Plugin::Export::OPFXML; | ||
Line 380: | Line 397: | ||
1; | 1; | ||
+ | |||
+ | '''Don't forget to add the line to enable the EXPORT::OPFXML in the packages config file''' | ||
This is no where near as complete as the original input from above, can you make the 2 match? | This is no where near as complete as the original input from above, can you make the 2 match? | ||
Line 387: | Line 406: | ||
This is an easy one. | This is an easy one. | ||
* Find the eprint you imported | * Find the eprint you imported | ||
− | * Click the Export | + | * Click the Actions Tab |
+ | * Locate the Export options at the bottom of this tab | ||
* Find your export plug-in (by name) and click it | * Find your export plug-in (by name) and click it | ||
Latest revision as of 14:03, 20 July 2011
Import and Export Plug-ins are all about taking an object, de-serialising it into an eprint (or other object in the EPrints system) and re-serialising it back into an object to be supported.
We would typically advocate that every import plug-in should have a corresponding export plug-in and visa-versa.
Contents
Enable the Plugin's
Create the package directory that will contain the package's configuration file:
$ mkdir -p lib/epm/io_exercise/cfg/cfg.d
Create a configuration file that enables the plugins in this package - this file is copied into the repository when the package is enabled:
$ gedit lib/epm/hello_world/cfg/cfg.d/io_exercise.pl $c->{plugins}{"Import::OPFXML"}{params}{disable} = 0; $c->{plugins}{"Import::XSLT::DC"}{params}{disable} = 0;
Input Data
We are going to attempt to import the following XML record into our system. If you know the format of this record all the better else we'll leave it as a surprise for now.
<?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://www.idpf.org/2007/opf" unique-identifier="EPB-UUID" version="2.0"> <metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:title>The Adventures of Sherlock Holmes</dc:title> <dc:creator opf:role="aut" opf:file-as="Doyle, Sir Arthur Conan">Sir Arthur Conan Doyle</dc:creator> <dc:publisher>epubBooks (www.epubbooks.com)</dc:publisher> <dc:date opf:event="epub-publication">2010-06-15</dc:date> <dc:subject>Crime/Detective</dc:subject> <dc:subject>Short Stories</dc:subject> <dc:source>Project Gutenberg</dc:source> <dc:rights> Provided for free by epubBooks.com. Not for commercial use. This EPUB eBook is released under a Creative Commons (BY-NC-ND/3.0) Licence. Source text and images are in the Public Domain. </dc:rights> <dc:identifier id="EPB-UUID">urn:uuid:FF946905-6C08-1014-90EA-E81F8523F0DC</dc:identifier> <dc:language>en-gb</dc:language> </metadata> <manifest> </manifest> <spine> </spine> </package>
Importing
To import this record we need a fully featured import plug-in which is capable of handle both file and raw XML input, we are also going to use an XSLT plug-ins to make the process even easier.
The Files
We need the following:
- lib/plugins/EPrints/Plugin/Import/OPFXML.pm
- lib/plugins/EPrints/Plugin/Import/XSLT/DC.xsl
OPFXML.pm
This will be our master import plug-in to handle the entire chunk of XML. We will then use the DC plug-in to map the metadata into an eprint we can use later.
Since this is the master plug-in this requires quiet a sizeable amount of code:
package EPrints::Plugin::Import::OPFXML; use strict; # Declare Namespaces our $DC_NS = "http://purl.org/dc/elements/1.1/"; our $DCTERMS_NS = "http://purl.org/dc/terms/"; our $OPF_NS = "http://www.idpf.org/2007/opf"; # This is just a variant of the DefaultXML plug-in use EPrints::Plugin::Import::DefaultXML; our @ISA = qw/ EPrints::Plugin::Import::DefaultXML /; sub new { my( $class, %params ) = @_; my $self = $class->SUPER::new(%params); $self->{name} = "OPF Resource"; # Make it visible on the import menu and elsewhere $self->{visible} = "all"; $self->{produce} = [ 'list/eprint', 'dataobj/eprint' ]; # Functionality to recognise XML types on import by recognising the base namespace, works with the sword packaging format, dc:conformsTo or similar. $self->{xmlns} = "http://www.idpf.org/2007/opf"; my $rc = EPrints::Utils::require_if_exists("MIME::Types"); unless( $rc ) { $self->{visible} = ""; $self->{error} = "Failed to load required module MIME::Types"; } return $self; } # Input File Handle Method, for when files are uploaded sub input_fh { my( $plugin, %opts ) = @_; my $fh = $opts{"fh"}; my $xml = join "", <$fh>; my $list; if( $xml =~ /^<\?xml/ ) { $list = $plugin->input_fh_xml( $xml, %opts ); } else { $list = $plugin->input_fh_list( $xml, %opts ); } $list ||= EPrints::List->new( dataset => $opts{dataset}, session => $plugin->{session}, ids => [] ); return $list; } # Handle direct XML input sub input_fh_xml { my( $plugin, $xml, %opts ) = @_; my $doc = EPrints::XML::parse_xml_string( $xml ); my $dataobj = $plugin->xml_to_dataobj( $opts{dataset}, $doc->documentElement ); EPrints::XML::dispose( $doc ); return EPrints::List->new( dataset => $opts{dataset}, session => $plugin->{session}, ids => [defined($dataobj) ? $dataobj->get_id : ()] ); } # Go grab input from a URL sub input_fh_list { my( $plugin, $url, %opts ) = @_; $url =~ s/\s+//g; my $tmpfile = File::Temp->new; my $r = EPrints::Utils::wget( $plugin->{session}, $url, $tmpfile ); seek($tmpfile,0,0); if( $r->is_error ) { $plugin->error( "Error reading resource from $url: ".$r->code." ".$r->message ); return; } my @ids; while(my $url = <$tmpfile>) { $url =~ s/\s+//g; next unless $url =~ /^http/; my $doc; eval { $doc = EPrints::XML::parse_url( $url ) }; if( $@ ) { $plugin->warning( "Error parsing: $url\n" ); } my $dataobj = $plugin->xml_to_dataobj( $opts{dataset}, $doc->documentElement ); EPrints::XML::dispose( $doc ); if( defined $dataobj ) { push @ids, $dataobj->get_id; } } return EPrints::List->new( dataset => $opts{dataset}, session => $plugin->{session}, ids => \@ids ); } # Translate this XML into an EPrint sub xml_to_dataobj { # $xml is the PubmedArticle element my( $plugin, $dataset, $xml ) = @_; my $session = $plugin->{session}; # Locate the metadata element my $metadata = $xml->getElementsByTagNameNS( $OPF_NS, "metadata" )->[0]; # Load the DC plugin my $dc_plugin = $session->plugin( "Import::XSLT::DC", processor => $plugin->{processor}, dataset => $dataset, ); $dc_plugin->{Handler} = $plugin->{Handler}; $dc_plugin->{parse_only} = $plugin->{parse_only}; # Spew the metadata element to a temp file my $tmpfile2 = File::Temp->new; print $tmpfile2 $metadata->toString(); seek($tmpfile2,0,0); # Parse the file using the plug-in to get back a list of eprints my $list = $dc_plugin->input_fh( fh => $tmpfile2, dataset => $dataset ); my( $eprint ) = $list->get_records( 0, 1 ); return $eprint; } 1;
XSLT/DC.xsl
Using XSLT is really quiet a nice way of parsing data and well documented so I shouldn't have to say much about how this works here.
<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" ept:name="DC" ept:visible="all" ept:advertise="1" ept:sourceNamespace="http://purl.org/dc/elements/1.1/" ept:targetNamespace="http://eprints.org/ep2/data/2.0" ept:produce="list/eprint" xmlns:ept="http://eprints.org/ep2/xslt/1.0" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns="http://eprints.org/ep2/data/2.0" > <xsl:output method="xml" indent="yes" encoding="utf-8"/> <xsl:template match="/"> <eprints> <eprint> <xsl:apply-templates select="./*" /> <eprint_status>inbox</eprint_status> <creators> <xsl:for-each select="//dc:creator"> <item><name><family/><given><xsl:value-of select="."/></given></name></item> </xsl:for-each> </creators> <corp_creators> <xsl:for-each select="//dc:source"> <item> <xsl:value-of select="." /> </item> </xsl:for-each> </corp_creators> <keywords> <xsl:for-each select="//dc:subject"> <xsl:value-of select="." />, </xsl:for-each> </keywords> </eprint> </eprints> </xsl:template> <xsl:template match="dc:title"> <title><xsl:value-of select="." /></title> </xsl:template> <xsl:template match="dc:publisher"> <publisher><xsl:value-of select="." /></publisher> </xsl:template> <xsl:template match="dc:rights"> <copyright_holders><xsl:value-of select="." /></copyright_holders> </xsl:template> <xsl:template match="dc:date"> <date><xsl:value-of select="." /></date> </xsl:template> <xsl:template match="dc:identifier"> <id_number><xsl:value-of select="." /></id_number> </xsl:template> <xsl:template match="dc:language|dc:subject|dc:source|dc:creator"/> </xsl:stylesheet>
Testing
You can test your plug-ins by simply clicking on the "manage deposits" button in your toolbar and selecting "OPF Resource" from the drop down list of import plug-ins. If it doesn't appear restart you web server and check the error log!
Reminder:
apache2ctl restart && tail -f /var/log/apache2/error.log
Once you have the import screen, copy and paste the XML from the top of this exercise into it and see if it creates you an EPrint (which you can see back at the manage deposits screen)
Exporting
So now we have imported our EPrint from this OPF format we need to be able to put it back into this format for exporting, this has the advantage that you can also export ALL other eprints in this format.
We did most of the work on the importing so now it's your turn below is the basis of the Export plug-in which you can use to get started. To give you a clue this export plug-in needs to be located in lib/plugins/EPrints/Plugin/Export/OPFXML.pm
package EPrints::Plugin::Export::OPFXML; use EPrints::Plugin::Export::XMLFile; @ISA = ( "EPrints::Plugin::Export::XMLFile" ); use strict; sub new { my( $class, %opts ) = @_; my $self = $class->SUPER::new( %opts ); $self->{name} = "Open Packaging Format XML"; # This plug-in can only output a single eprint record, no lists or other types of objects $self->{accept} = [ 'dataobj/eprint' ]; $self->{visible} = "all"; # Specify the mimetype so we can use an HTTP ACCEPT header to get this back. $self->{mimetype} = "application/oebps-package+xml; charset=utf-8"; # Again specify what we are exporting $self->{xmlns} = "http://www.idpf.org/2007/opf"; $self->{schemaLocation} = "http://www.idpf.org/2007/opf"; return $self; } # Method which is called to output the eprint, we would have output_list if we could handle lists sub output_dataobj { my( $plugin, $dataobj ) = @_; my $xml = $plugin->xml_dataobj( $dataobj ); return "<?xml version='1.0' encoding='UTF-8'?>" . EPrints::XML::to_string( $xml ); } # The method which actually does the work sub xml_dataobj { my( $plugin, $eprint ) = @_; my $repo = $plugin->{repository}; my $package = $repo->make_element("package", xmlns=>"http://www.idpf.org/2007/opf", "unique-identifier"=>"EPB-UUID", version=>"2.0"); my $metadata = $repo->make_element("metadata", "xmlns:opf"=>"http://www.idpf.org/2007/opf", "xmlns:dc"=>"http://purl.org/dc/elements/1.1/"); $package->appendChild($metadata); my $title = $repo->make_element("dc:title"); $title->appendText($eprint->value("title")); $metadata->appendChild($title); if( $eprint->exists_and_set( "creators" ) ) { foreach my $name ( @{ $eprint->get_value( "creators" ) } ) { # given name first my $creator = $repo->make_element("dc:creator"); $creator->appendText(EPrints::Utils::make_name_string( $name->{name}, 1 )); $metadata->appendChild($creator); } } return $package; } 1;
Don't forget to add the line to enable the EXPORT::OPFXML in the packages config file
This is no where near as complete as the original input from above, can you make the 2 match?
Testing
This is an easy one.
- Find the eprint you imported
- Click the Actions Tab
- Locate the Export options at the bottom of this tab
- Find your export plug-in (by name) and click it
Wrap it as a package
Once again, now this is working you want to remove the files and package them up with a spec file and icon. Then remove the files from your local install and try doing an install via the Bazaar to see if it works.
Why not see the export plug-in working via a GET request to the id url of the eprint http://archive.org/id/eprint/XXX with the ACCEPT header set to the mime-type of your desired export plug-in. This is quick proof that content negotiation works in EPrints and it's that easy to add a new exporter.
Extension Exercises
This whole exercise has been the beginning of importing the epub format into EPrints!
If you are feeling keen see if you can scope out or develop this further to import more of the format, maybe some of the chapters etc.
The full epub for this book is available @ http://www.epubbooks.com/book/378/adventures-of-sherlock-holmes and you have been importing a small section of the epb.opf file which is listed in the META-INF/container.xml
epub is a format we are keen to see EPrints support and there may be a bounty available for the first person to develop a complete Bazaar Package to support them.