OR2012 EThOS presentation
The following notes form the basis of the presentation at OR2012 about experiences of the EThOS webservice, developing a connector for it and how to use the connector.
About the webservice
The webservice uses SOAP / WSDL architecture. The WSDL is a computer-parsable file that describes how to interrogate the EThOS service. The XSD describes which elements you need to send to the service,and what data you'll get back.
For EPrints, you can use the SOAP::Lite module to format data and send it to webservice.
About the code
The code available at http://files.eprints.org/778/ allows you to query the webservice and download theses for your institution.
The code consists of two parts:
- a bin script that runs from the command line.
- a config file (format may seem a little convoluted. It was written to cope with the 3-institution consortium model of the White Rose Etheses Online archive).
The configuration allows you to specify defaults for EThOS imports on top of the normal record defaults. It also allows you to match and process some data elements to match the EPrint model e.g. thesis_type (a controlled list in EPrints, a text element in EThOS).
The code is designed to be run periodically (e.g. monthly) and will attempt to download thesis where possible/necessary.
A word on Identifiers
Data harvested from your repository will include a
institutionReference element - from the OAI-PMH interface.
EThOS records are issued with an EThOSid. This isn't directly stated in the webservice, but is the
eprintId prefixed with 'uk.bl.ethos.'. By storing this identifier in the EPrints record, future runs of the download too can see that the record has already been ingested. It should also help the EThOS service match records too.
Downloading in anger
For the White Rose Etheses Online repository, we created a user, a user type and a custom workflow for importing the EThOS records.
This allows us to deal with the imported records as a seperate task to incoming student-submitted etheses. The workflow for our editors is minimal. There is one screen for metadata, and one screen for the files.
The metadata screen was specified by the team of people who would be working with it. Metadata elements that are required are at the top, those that would probably not get completed are collapsed at the bottom of the screen (we don't have many email address' recorded for 1937 theses!).
Issues with the plugin
The plugin code was initially written to download the initial batch of theses being digitised by the BL. It was designed to get the record back to the local repository quickly.
During early testing of Sheffield theses, the data available in the webservice record was the same as that in the UKETD_DC record. For this reason the connector doesn't use the full UKETD_DC record (this is bad. Sorry!).
An updated script will be published that makes full use of the UKETD_DC record.
Issues with the Webservice
- If you try and download a lot of records the connection to the webservice may time out.
- Author names: Messy - and don't match those available via the EThOS website. This is being investigated by the EThOS crew.
- Make an EPrints Bazaar connector.
- Understand what data models other EPrints repositories have (what IDs do the use?).
- Use full UKETD_DC data from webservice
- Make full use of UKETD_DC data
- Make sure data exposed over OAI-PMH is available in UKETD_DC (and easily harvestable by EThOS!)