Difference between revisions of "Migration"

From EPrints Documentation
Jump to: navigation, search
(Known bugs in current version of toolkit)
(Redirected page to Moving a repository)
 
(23 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 +
#REDIRECT [[Moving a repository]]
 +
 +
[[Category:Management]]
 
This page covers how to migrate from EPrints 2 to EPrints 3.
 
This page covers how to migrate from EPrints 2 to EPrints 3.
  
Line 5: Line 8:
 
The migration toolkit, available from http://files.eprints.org/ does quite a bit of the heavy lifting. It is intended to help configure an EP3 archive to have the same files, eprint types etc. as an EPrint 2 repository and then copy the data over.
 
The migration toolkit, available from http://files.eprints.org/ does quite a bit of the heavy lifting. It is intended to help configure an EP3 archive to have the same files, eprint types etc. as an EPrint 2 repository and then copy the data over.
  
Release 0.2 of the toolkit is still very raw. Later versions will provide more functionaliy, but some people need this ASAP so we're releasing very early versions.
+
Release 1.0-beta-1 should be a big improvement over 0.2 but it still doesn't do everything.  
  
 
=== Installation ===
 
=== Installation ===
 +
 +
==== Backup ====
 +
 +
First of all make sure your EPrints 2 repository is backed up, just in case things don't go to plan. You already back it up daily anyway, right...?
 +
 +
==== Mtoolkit ====
  
 
Un-tar the package on the same machine as your EPrints 2 repository.
 
Un-tar the package on the same machine as your EPrints 2 repository.
Line 13: Line 22:
 
If your EPrints 2 was not installed in /opt/eprints2 then you'll need to modify the first line of the two .pl scripts in the toolkit.
 
If your EPrints 2 was not installed in /opt/eprints2 then you'll need to modify the first line of the two .pl scripts in the toolkit.
  
Also, get an EPrints 3 server set up. This can be either on the same machine (you'll need a separate instance of apache as ep2 and ep3 can't run under the same server at the same time, put it on port 8080 for now), or on a different machine. Get a repository created (probably with the same ID as your ep2 repo, although that's not essential). The database will need to be different or you'll get in a mess.
+
==== EPrints 3 ====
 +
 
 +
Minimum version required: 3.0.2 (This version introduces some very small options and bugfixes aimed at migration).
 +
 
 +
Also, get an EPrints 3 server set up. This can be either on the same machine (you'll need a separate instance of apache as ep2 and ep3 can't run under the same server at the same time, put it on port 8080 for now - see http://httpd.apache.org/docs/2.0/install.html for instructions - put it in another directory using the --PREFIX option!), or on a different machine. Get a repository created (probably with the same ID as your ep2 repo, although that's not essential). The database will need to be a different name or you'll get in an utter mess.
  
 
=== mkconfig.pl ===
 
=== mkconfig.pl ===
Line 19: Line 32:
 
This tool takes the id of an EPrints 2 repository and generates a number of EPrints 3 config. files. Copy these files into the cfg dir of your EPrints 3 repository. It also creates a file called migration_notes.txt with some helpful comments of anything it's messed with.
 
This tool takes the id of an EPrints 2 repository and generates a number of EPrints 3 config. files. Copy these files into the cfg dir of your EPrints 3 repository. It also creates a file called migration_notes.txt with some helpful comments of anything it's messed with.
  
Get your (empty) EP3 repository up and running using these configuration files.
+
Get your (empty) EP3 repository up and running using these configuration files.  
 
 
Add "public" to cfg/namedsets/security (this should happen automatically, but doesn't, yet)
 
  
 
=== export3data.pl ===
 
=== export3data.pl ===
  
 
This script exports the data from your EPrints 2 repostory in a format which can be imported by EPrints 3.
 
This script exports the data from your EPrints 2 repostory in a format which can be imported by EPrints 3.
 +
 +
There have been some problems with exporting non Latin characters (e.g. letters with accents).  If you have any problems, these can probably be solved by editing the export3data script and adding the following line (put it just under the first line).
 +
 +
  use encoding 'utf8';
  
 
To export the data do the following:
 
To export the data do the following:
Line 33: Line 48:
 
   export3data.pl ARCHIVEID subjects > subjects.xml
 
   export3data.pl ARCHIVEID subjects > subjects.xml
  
Note that "eprints.xml" will be huge as it contains all documents, including the actual files.
+
eprints.xml references the full paths of the files in EPrints 2. If your EPrints 3 is on a different machine you'll need to either make sure they are the same on the new machine or do a big search-and-replace on eprints.xml!
  
=== Importing ===
+
If the script has any problems, run with the 'skiplog' argument:
  
=== Some little hacks required ===
+
  export3data.pl --skiplog errors.txt ARCHIVEID eprints > eprints.xml
To preserve the ID's of the eprints and users a little hack is required. Edit perl_lib/EPrints/DataObj.pm find the subroutine "create_from_data". Find the line:
 
  
  next if $field->get_property( "import" );
+
Any items with problems will be ignored, but the ids of them will be recorded in the 'errors.txt' file. Export these by hand if they are important.
  
and after it add:
+
=== Importing ===
 
 
next if( ( $dataset->id eq "eprint" || $dataset->id eq "user" ) && $field->get_name eq $dataset->get_key_field->get_name );
 
  
(remove this hack-line once you're finished importing.)
+
EPrints 3.0.2 no longer needs the hacks which were required for mtoolkit 0.2
  
 
=== Empty out any test data ===
 
=== Empty out any test data ===
Line 56: Line 68:
 
=== Import the data ===
 
=== Import the data ===
  
To import the data do:
+
To import the subjects and users do:
  bin/import --verbose --force ARCHIVEID subject XML subjects.xml
+
  /opt/eprints3/bin/import_subjects --verbose --force --xml ARCHIVEID subjects.xml
  bin/import --verbose --force ARCHIVEID user XML users.xml
+
  /opt/eprints3/bin/import --verbose --migration ARCHIVEID user XML users.xml
  bin/import --verbose --force ARCHIVEID eprint XML eprints.xml
+
If something goes wrong with subjects or users, use epadmin erase_data to empty the database and start again.
+
 
If something goes wrong use epadmin erase_data to empty the database and start again.
+
To import the EPrints do:
 +
  /opt/eprints3/bin/import --verbose --migration ARCHIVEID eprint XML eprints.xml
 +
If something goes wrong with importing the eprints, use epadmin erase_eprints, to just erase the eprints data so you don't need to redo subjects and users.
 +
 
 +
the --migration option tells the importer to:
 +
* skip are-you-sure? messages.
 +
* use the eprintid and userid from the XML rather than assigning them.
 +
* use the "datestamp" from the XML rather than assign it.
 +
* load files from the local file system (normally this would be a security hole)
 +
 
 +
You may encounter some issues with badly formed XML. This is due to non correctly encoded data creeping into your database. It should all be utf-8 but earlier versions of EPrints didn't always check... If your EPrints 2 server is running perl 5.8 you can install the Perl module Encode which will clean up your data, but on our system our EPrints 2 was running on a machine with an older version of Perl and we didn't want to risk upgrading.
  
If everything works then you need to update the counters table (via the mysql command line). Find out the maximum id number of eprints and users:
+
== Finishing up after using mtoolkit ==
  
mysql> select max(eprintid) from eprint;
+
You will probably still want to tweak some of the following things by hand, depending how much you customised EPrints 2:
+---------------+
 
| max(eprintid) |
 
+---------------+
 
|          141 |
 
+---------------+
 
1 row in set (0.00 sec)
 
  
and
+
Some of these we can't easily add to the mtoolkit (those involving perl code). The XML files we could add in theory, but we've made a decision to release 1.0 with the current features, rather than delay it months but make it perfect.
  
mysql> select max(userid) from user;
+
* the template
 +
* the workflow (EPrints 3 offers some nice features, look at the lib/defaultcfg/workflows/ for an idea of what you can do)
 +
* the static pages (.xpage)
 +
* the citation files
 +
* the /view/ browsing configuration
 +
* the search configuration
 +
* any custom render routines
 +
* the render eprint method (eprint_render.pl)
 +
* any custom document security options
 +
* any custom validation options
 +
* etc.
  
then set the counters to be one more than the maxium current value. This way new eprints + users will be given id's higher than the imported items.
+
Feel free to add tips on the wiki, linked from this section.
  
UPDATE counters SET counter=142 WHERE countername='eprintid';
 
UPDATE counters SET counter=43 WHERE countername='userid';
 
  
Nb. 142 and 43 are just examples.
+
== Known bugs in version 1.0 of toolkit / importing into EPrints 3.0.2 ==
  
= Issues =
+
=== Documents with subdirectories fail to import ===
  
There's going to be lots, I'm sure. Please leave both comments and tips.
+
FIX: do them by hand at the end.
  
 +
=== Warning messages about "hideemail" ===
  
 +
hideemail was introduced in a version of EPrints 2 (I forget which). Earlier repositories may not have this field. Some of the EPrints 3 default config files assume it exists (user_fields_default.pl and user_render.pl).
  
== Tips ==
+
FIX 1: Don't worry about it.
  
After you've got it working, you probably want to clean up the workflow to make use of the Multi components. Look at the default /opt/eprints3/lib/defaultcfg/workflow/eprints/default.xml config for some clues on how to do this, and how to add autocompleters.
+
FIX 2: Before importing users.xml, add the hideemail field back into user_fields.pl
 +
          {
 +
            'name' => 'hideemail',
 +
            'input_style' => 'radio',
 +
            'type' => 'boolean',
 +
          },
  
== Known Issues ==
+
=== Error missing field: X ===
  
* Citations not ported
+
The default EPrints 3 config. may reference a field not imported. If so you can almost always just remove the offending section of configuration. Examples: searches, citations, views.
* Template not ported
 
* Static pages not ported
 
* ArchiveRender methods not ported
 
* Annoying hack required to import
 
* Handy workflow features like autocomplete don't get turned on by default.
 
  
== Known bugs in current version of toolkit ==
+
=== Problems with bad characters in eprints.xml ===
  
=== Documents with subdirectories fail to import ===
+
This is not tested, but I think this should clean it up...
 +
iconv -c eprints.xml --output=eprints_cleaned.xml -f utf-8 -t utf-8
 +
 
 +
=== Warning about Pagerange ===
 +
 
 +
Argument "" isn't numeric in addition (+) at
 +
  /opt/eprints3/perl_lib/EPrints/MetaField/Pagerange.pm line 182.
 +
 
 +
This is a warning that is caused by having non-numeric data in the pagerange field. eg. "iii-xi".
 +
 
 +
FIX: Don't worry about it.
 +
 
 +
=== Can't import files which contain "/" ===
  
=== Current version does not properly escape & and angle brackets in document filenames. ===
+
eg if your document had index.html and images/dia.jpg
  
FIX: find the <nowiki><filename></nowiki> line in export3data.pl and change it to:
+
FIX: Make a note of the offenders, and just add those documents by hand.  
  
    print $fh "          <filename>".esc(latin1($filename))."</filename>\n";
+
FIX2: Bug chris to add this to fix this in the final release of 3.0.2 (it's not in beta-1)

Latest revision as of 14:44, 18 May 2012

This page covers how to migrate from EPrints 2 to EPrints 3.

Migration Toolkit

The migration toolkit, available from http://files.eprints.org/ does quite a bit of the heavy lifting. It is intended to help configure an EP3 archive to have the same files, eprint types etc. as an EPrint 2 repository and then copy the data over.

Release 1.0-beta-1 should be a big improvement over 0.2 but it still doesn't do everything.

Installation

Backup

First of all make sure your EPrints 2 repository is backed up, just in case things don't go to plan. You already back it up daily anyway, right...?

Mtoolkit

Un-tar the package on the same machine as your EPrints 2 repository.

If your EPrints 2 was not installed in /opt/eprints2 then you'll need to modify the first line of the two .pl scripts in the toolkit.

EPrints 3

Minimum version required: 3.0.2 (This version introduces some very small options and bugfixes aimed at migration).

Also, get an EPrints 3 server set up. This can be either on the same machine (you'll need a separate instance of apache as ep2 and ep3 can't run under the same server at the same time, put it on port 8080 for now - see http://httpd.apache.org/docs/2.0/install.html for instructions - put it in another directory using the --PREFIX option!), or on a different machine. Get a repository created (probably with the same ID as your ep2 repo, although that's not essential). The database will need to be a different name or you'll get in an utter mess.

mkconfig.pl

This tool takes the id of an EPrints 2 repository and generates a number of EPrints 3 config. files. Copy these files into the cfg dir of your EPrints 3 repository. It also creates a file called migration_notes.txt with some helpful comments of anything it's messed with.

Get your (empty) EP3 repository up and running using these configuration files.

export3data.pl

This script exports the data from your EPrints 2 repostory in a format which can be imported by EPrints 3.

There have been some problems with exporting non Latin characters (e.g. letters with accents). If you have any problems, these can probably be solved by editing the export3data script and adding the following line (put it just under the first line).

 use encoding 'utf8';

To export the data do the following:

 export3data.pl ARCHIVEID eprints > eprints.xml
 export3data.pl ARCHIVEID users > users.xml
 export3data.pl ARCHIVEID subjects > subjects.xml

eprints.xml references the full paths of the files in EPrints 2. If your EPrints 3 is on a different machine you'll need to either make sure they are the same on the new machine or do a big search-and-replace on eprints.xml!

If the script has any problems, run with the 'skiplog' argument:

 export3data.pl --skiplog errors.txt ARCHIVEID eprints > eprints.xml

Any items with problems will be ignored, but the ids of them will be recorded in the 'errors.txt' file. Export these by hand if they are important.

Importing

EPrints 3.0.2 no longer needs the hacks which were required for mtoolkit 0.2

Empty out any test data

To erase the current data in your EP3 repository use:

bin/epadmin erase_data ARCHIVEID

Import the data

To import the subjects and users do:

/opt/eprints3/bin/import_subjects --verbose --force --xml ARCHIVEID subjects.xml
/opt/eprints3/bin/import --verbose --migration ARCHIVEID user XML users.xml

If something goes wrong with subjects or users, use epadmin erase_data to empty the database and start again.

To import the EPrints do:

/opt/eprints3/bin/import --verbose --migration ARCHIVEID eprint XML eprints.xml

If something goes wrong with importing the eprints, use epadmin erase_eprints, to just erase the eprints data so you don't need to redo subjects and users.

the --migration option tells the importer to:

  • skip are-you-sure? messages.
  • use the eprintid and userid from the XML rather than assigning them.
  • use the "datestamp" from the XML rather than assign it.
  • load files from the local file system (normally this would be a security hole)

You may encounter some issues with badly formed XML. This is due to non correctly encoded data creeping into your database. It should all be utf-8 but earlier versions of EPrints didn't always check... If your EPrints 2 server is running perl 5.8 you can install the Perl module Encode which will clean up your data, but on our system our EPrints 2 was running on a machine with an older version of Perl and we didn't want to risk upgrading.

Finishing up after using mtoolkit

You will probably still want to tweak some of the following things by hand, depending how much you customised EPrints 2:

Some of these we can't easily add to the mtoolkit (those involving perl code). The XML files we could add in theory, but we've made a decision to release 1.0 with the current features, rather than delay it months but make it perfect.

  • the template
  • the workflow (EPrints 3 offers some nice features, look at the lib/defaultcfg/workflows/ for an idea of what you can do)
  • the static pages (.xpage)
  • the citation files
  • the /view/ browsing configuration
  • the search configuration
  • any custom render routines
  • the render eprint method (eprint_render.pl)
  • any custom document security options
  • any custom validation options
  • etc.

Feel free to add tips on the wiki, linked from this section.


Known bugs in version 1.0 of toolkit / importing into EPrints 3.0.2

Documents with subdirectories fail to import

FIX: do them by hand at the end.

Warning messages about "hideemail"

hideemail was introduced in a version of EPrints 2 (I forget which). Earlier repositories may not have this field. Some of the EPrints 3 default config files assume it exists (user_fields_default.pl and user_render.pl).

FIX 1: Don't worry about it.

FIX 2: Before importing users.xml, add the hideemail field back into user_fields.pl

         {
           'name' => 'hideemail',
           'input_style' => 'radio',
           'type' => 'boolean',
         },

Error missing field: X

The default EPrints 3 config. may reference a field not imported. If so you can almost always just remove the offending section of configuration. Examples: searches, citations, views.

Problems with bad characters in eprints.xml

This is not tested, but I think this should clean it up...

iconv -c eprints.xml --output=eprints_cleaned.xml -f utf-8 -t utf-8

Warning about Pagerange

Argument "" isn't numeric in addition (+) at
 /opt/eprints3/perl_lib/EPrints/MetaField/Pagerange.pm line 182.

This is a warning that is caused by having non-numeric data in the pagerange field. eg. "iii-xi".

FIX: Don't worry about it.

Can't import files which contain "/"

eg if your document had index.html and images/dia.jpg

FIX: Make a note of the offenders, and just add those documents by hand.

FIX2: Bug chris to add this to fix this in the final release of 3.0.2 (it's not in beta-1)