Making a fossilised repository
(This work requires access to the database & original file-system)
The rough plan is to pull all of the visible pages in the repository and make it available as a static copy - no dynamic stuff, no
This process assumes you have access to the underlying database, and the file-store for the repo.
Assume:
- you're working as a non-root user on a workstation
- your repo is accessable [to you] at http://my.repo/
- your web server will have a document-root of
~/www/my.repo - you know how to set a document-root on your web server
- you'll have to do some tidying up (finding images, etc) yourself
Contents
Grab a copy of the html pages
mkdir ~/www cd ~/www wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache --mirror -nc -k http://my.repo/
Grab a copy of all the abstract pages
You need check the database - the last eprintid in the eprints table is the number you need to count up to
cd my.repo
for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done
[(a) note the lack of --mirror option, abd (b) this *will* take quite a while..... but are you in a hurry?]
This downloads each page into a file called $id, and we need to make that a directory:
---- perl code ----
# run in ~/www/my.repo
use File::Slurp;
my $root = '.';
opendir( my $dh, $root) || die "can't open doc root\n";
my @files = grep { /^\d/ && -f "$root/$_" } readdir($dh);
close $dh;
foreach my $file (@files) {
my $source = "$root/$file";
my $content = read_file($source, binmode => ':utf8');
unlink $source;
mkdir $source;
write_file( "$source/index.html", {binmode => ':utf8'}, $content ) ;
}
---- perl code ----
Grab a copy of all the documents
The problem here is that EPrints has a funny directory structure, which doesn't map to URLs, so we grab a copy, and then move them into the right place. (Still in ~/www/my.repo/)
scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .
... will copy disk0 and all it's subdirectories into the current directory.
Copy the documents into the right location in the web site
We then use a move script to move each of the documents into the appropriate abstract directory:
---- perl code ----
# you will need to tweek the root & number of nested loops if you have more data.
use File::Copy::Recursive;
my $root = './00/00';
my $destination = '.';
opendir( my $dh, $root) || die "can't open doc root\n";
my @tlfs = grep { /^\w/ && -r "$root/$_" } readdir($dh);
close $dh;
foreach my $tlf (@tlfs) {
my $dir = "$root/$tlf";
opendir( my $dh, $dir) || die "can't open $dir\n";
my @blfs = grep { /^\w/ && -r "$dir/$_" } readdir($dh);
close $dh;
foreach my $blf (@blfs) {
my $combined = $tlf . $blf;
my $final = $combined + 0;
my $docs = "$dir/$blf";
my $target = "$destination/$final";
print "move $docs -> $target\n";
File::Copy::Recursive::rcopy($docs, $target);
}
}
---- perl code ----
... This should copy disk0/00/00/02/42/01/something.pdf to 242/01/something.pdf
You can now remove the whole disk0 tree
rm -rf disk0
Tidy up leading 0's
In the abstract pages, the URLs are 242/1/something.pdf, so we need to delete all the leading 0's:
---- perl code ----
use File::Slurp;
my $root = '.';
opendir( my $dh, $root) || die "can't open doc root\n";
my @tlfs = grep { /^\d/ && -d "$root/$_" } readdir($dh);
close $dh;
foreach my $tlf (@tlfs) {
my $dir = "$root/$tlf";
opendir( my $dh, $dir) || die "can't open $dir\n";
my @blfs = grep { /^\d/ && -r "$dir/$_" } readdir($dh);
close $dh;
foreach my $blf (@blfs) {
my $old = "$dir/$blf";
my $fn = $blf + 0;
my $new = "$dir/$fn";
rename $old, $new;
}
}
---- perl code ----
Deal with all those Absolute URLs
In a significant number of pages, hrefs are absolute, so need to be made relative. This snippet will fix that:
find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +
Missing stylesheet & images
These will be under archives/<ARCHIVEID>/html/en/style on your EPrints server - copy the whole style across to your local style directory
Unwanted links to dynamic content
Things like seaching and RSS feeds need to be removed from all web pages. Try:
find . -type f -exec sed -i '/rel="alternate"\|rel="Search"\|search\/simple\|ep_search_feed/d' {} +
The some of the abstract pages will have the Preview link - that needs to go:
find . -name '*.html' -exec sed -r -i 's/(\s\|\s)?<a[^>]+>Preview<\/a>//' {} +
The abstract pages will also list the local Persistent URI - which goes to a URL that's not supported - that also needs to go:
find . -name '*.html' -exec perl -p -i -e 's/URI:<\/th> .*?<\/td><\/tr>//' {} +
You may have export options for some of your browse pages - they won't work:
find view/ -name '*.html' -exec perl -p -i -e '$/=undef;s/<form\b[^>]*>.*?<\/form>//s' {} +
No point in having anything to suggest logins may work - so get rid of all the "Actions" sections:
find . -name '*.html' -exec perl -p -i -e 's/
Actions \(login required\)<\/h3>.*?<\/table>//' {} +
Get the web server to serve pages
.... and now you can point a web server's "document root" at /path/to/my.repo and it should all "just work"
(you may have file-permissions to deal with - but that's not difficult)
/path/to/my.repo and it should all "just work"
(you may have file-permissions to deal with - but that's not difficult)