Files/EThOS webservice download tool

From EPrints Documentation
Revision as of 09:15, 10 July 2012 by Libjlrs (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The British Library EThOS service has a SOAP based webservice to allow institutions to pull digitized theses back from their collection.

Some code is available http://files.eprints.org/778/ to pull the theses back from EThOS and to populate records that have been harvested by EThOS (via OAI-PMH) with their EThOS Ids.

A presentation from OR2012 is also available on the wiki: OR2012_EThOS_presentation.

The eprints script uses:

  • SOAP::Lite (to make the calls to the webservice)
  • LWP (to download the Zip files)
  • Archive::Zip (to extract the files)
  • Time::Local (to generate timestamps for the webservice)

The code consists of two parts:

  • a bin script (goes in ~/bin/)
  • a config file (goes in ~/archives/ARCHIVEID/cfg/cfg.d/)

Run ~/bin/ethos to see usage

The following rules are used: 1) If an incoming thesis contains an institutionalReference (therefore has been harvested):

  • A search is done on the reference
    • If the eprint doesn't exist, no import is done.
    • If the eprint exists and doesn't have an id_number, the id_number is added.
    • If the eprint exists and the id_number matches the ethosId no import is done.
    • If the eprint exists and the id_number doesn't match the ethosId, an error is reported.

2) If the incoming thesis does not have an institutionalReference, a search for the ethosId in the id_number field is conducted.

  • If an eprint exists with that ethosId, no import is done.
  • If the ethosId is not found, a new eprint is created. Files are downloaded, and added to the eprint. Any issues with the download are recorded in the 'suggestions' field.

The following fields are set in the EPrint:

  • userid
  • title
  • date
  • abstract
  • id_number
  • keywords
  • creators_name
  • thesis_type
  • suggestions

The scripts were written based on the White Rose Etheses Online set up. This means the config file may seem a bit warped in it's structure - as it was designed to be able to cope with 3 institutions.

EThOS Identifiers

EThOS identifiers are of the form: uk.bl.ethos.xxxxx. If you have ethos identifiers stored in your system in a different format, you will need to tweak the code to cope with your way of storing them. Any question/problems with the script, email the tech-list!

Downloading EThOS records in practice

The process of dealing with etheses the three White Rose institutions is different at each site. In general though, there is a different route and set of people dealing with theses coming from EThOS to those being deposited by students as part of teir studies. We have created an 'EThOS import' user for each site, and provided a custom workflow for these users. There are fewer 'required' metadata elements for EThOS theses - we don't have a record of any 'supervisor email address' for a 1956 thesis!