Difference between revisions of "Making a fossilised repository"

From EPrints Documentation
Jump to: navigation, search
(Created page with "Category:Howto (This work requires access to the database & original file-system) The rough plan is to pull all of the visible pages in the repository and make it availa...")
 
(Unwanted links to dynamic content)
 
(3 intermediate revisions by the same user not shown)
Line 10: Line 10:
 
# you're working as a non-root user on a workstation
 
# you're working as a non-root user on a workstation
 
# your repo is accessable [to you] at http://my.repo/
 
# your repo is accessable [to you] at http://my.repo/
# your web server will have a document-root of ~/www/my.repo
+
# your web server will have a document-root of <code>~/www/my.repo</code>
 
# you know how to set a document-root on your web server
 
# you know how to set a document-root on your web server
 
# you'll have to do some tidying up (finding images, etc) yourself
 
# you'll have to do some tidying up (finding images, etc) yourself
Line 26: Line 26:
 
   for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done
 
   for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done
  
[(a) note the lack -f --mirror option, abd (b) this *will* take quite a while..... but are you in a hurry?]
+
[(a) note the lack of <code>--mirror</code> option, abd (b) this *will* take quite a while..... but are you in a hurry?]
  
 
This downloads each page into a file called $id, and we need to make that a directory:
 
This downloads each page into a file called $id, and we need to make that a directory:
Line 57: Line 57:
 
     scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .
 
     scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .
  
... will copy 'disk0' and all it's subdirectories into the current directory.
+
... will copy <code>disk0</code> and all it's subdirectories into the current directory.
  
 
=== Copy the documents into the right location in the web site ===
 
=== Copy the documents into the right location in the web site ===
Line 64: Line 64:
 
<pre>
 
<pre>
 
---- perl code ----
 
---- perl code ----
# you will need to tweek the root & number oif nested loops if you have more data.
+
# you will need to tweek the root & number of nested loops if you have more data.
 
use File::Copy::Recursive;
 
use File::Copy::Recursive;
  
Line 90: Line 90:
 
---- perl code ----
 
---- perl code ----
 
</pre>
 
</pre>
... This should copy disk0/00/00/02/42/01/something.pdf to 242/01/something.pdf
+
... This should copy <code>disk0/00/00/02/42/01/something.pdf</code> to <code>242/01/something.pdf</code>
  
You can now remove the whole disk0 tree
+
You can now remove the whole <code>disk0</code> tree
  
 
     rm -rf disk0
 
     rm -rf disk0
Line 98: Line 98:
 
=== Tidy up leading 0's ===
 
=== Tidy up leading 0's ===
  
In the abstract pages, the URLs are 242/1/something.pdf, so we need to delete all the leading 0's:
+
In the abstract pages, the URLs are <code>242/1/something.pdf</code>, so we need to delete all the leading 0's:
 
<pre>
 
<pre>
 
---- perl code ----
 
---- perl code ----
Line 125: Line 125:
 
== Deal with all those Absolute URLs ==
 
== Deal with all those Absolute URLs ==
  
In a significant number of pages, '''href'''s are ''absolute'', so need to be made relative. This snippet will fix that:
+
In a significant number of pages, <code>href</code>s are ''absolute'', so need to be made relative. This snippet will fix that:
  
 
   find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +
 
   find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +
 +
 +
== Missing stylesheet & images ==
 +
 +
These will be under <code>archives/<ARCHIVEID>/html/en/style</code> on your EPrints server - copy the whole <code>style</code> across to your local <code>style</code> directory
 +
 +
== Unwanted links to dynamic content ==
 +
 +
Things like <code>seaching</code> and <code>RSS feeds</code> need to be removed from all web pages. Try:
 +
 +
    find . -type f -exec sed -i '/rel="alternate"\|rel="Search"\|search\/simple\|ep_search_feed/d' {} +
 +
 +
The some of the abstract pages will have the '''Preview''' link - that needs to go:
 +
 +
    find . -name '*.html' -exec sed -r -i 's/(\s\|\s)?<a[^>]+>Preview<\/a>//' {} +
 +
 +
The abstract pages will also list the local Persistent URI - which goes to a URL that's not supported - that also needs to go:
 +
 +
    find . -name '*.html' -exec perl -p -i -e 's/<tr><th align="right">URI:<\/th> <td valign="top">.*?<\/td><\/tr>//' {} +
 +
 +
You may have '''export''' options for some of your browse pages - they won't work:
 +
 +
    find view/ -name '*.html' -exec perl -p -i -e '$/=undef;s/<form\b[^>]*>.*?<\/form>//s' {} +
 +
 +
No point in having anything to suggest logins may work - so get rid of all the "Actions" sections:
 +
 +
<pre>    find . -name '*.html' -exec perl -p -i -e 's/<h3>Actions \(login required\)<\/h3>.*?<\/table>//' {} +</pre>
 +
 +
(that should be most of it...)
  
 
== Get the web server to serve pages ==
 
== Get the web server to serve pages ==
.... and now you can point a web server's "document root" at /path/to/my.repo and it should all "just work"
+
.... and now you can point a web server's "document root" at <code>/path/to/my.repo</code> and it should all "just work"
 
(you may have file-permissions to deal with - but that's not difficult)
 
(you may have file-permissions to deal with - but that's not difficult)

Latest revision as of 13:06, 20 July 2017


(This work requires access to the database & original file-system)

The rough plan is to pull all of the visible pages in the repository and make it available as a static copy - no dynamic stuff, no

This process assumes you have access to the underlying database, and the file-store for the repo.

Assume:

  1. you're working as a non-root user on a workstation
  2. your repo is accessable [to you] at http://my.repo/
  3. your web server will have a document-root of ~/www/my.repo
  4. you know how to set a document-root on your web server
  5. you'll have to do some tidying up (finding images, etc) yourself

Grab a copy of the html pages

  mkdir ~/www
  cd ~/www
  wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache --mirror -nc -k http://my.repo/

Grab a copy of all the abstract pages

You need check the database - the last eprintid in the eprints table is the number you need to count up to

  cd my.repo
  for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done

[(a) note the lack of --mirror option, abd (b) this *will* take quite a while..... but are you in a hurry?]

This downloads each page into a file called $id, and we need to make that a directory:

---- perl code ----
# run in ~/www/my.repo
use File::Slurp;
my $root = '.';

opendir( my $dh, $root) || die "can't open doc root\n";
my @files = grep { /^\d/ && -f "$root/$_" } readdir($dh);
close $dh;

foreach my $file (@files) {
    my $source = "$root/$file";
    my $content = read_file($source, binmode => ':utf8');
    unlink $source;
    mkdir $source;
    write_file( "$source/index.html", {binmode => ':utf8'}, $content ) ;
}
---- perl code ----

Grab a copy of all the documents

The problem here is that EPrints has a funny directory structure, which doesn't map to URLs, so we grab a copy, and then move them into the right place. (Still in ~/www/my.repo/)

   scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .

... will copy disk0 and all it's subdirectories into the current directory.

Copy the documents into the right location in the web site

We then use a move script to move each of the documents into the appropriate abstract directory:

---- perl code ----
# you will need to tweek the root & number of nested loops if you have more data.
use File::Copy::Recursive;

my $root = './00/00'; 
my $destination = '.';

opendir( my $dh, $root) || die "can't open doc root\n";
my @tlfs = grep { /^\w/ && -r "$root/$_" } readdir($dh);
close $dh;

foreach my $tlf (@tlfs) {
    my $dir = "$root/$tlf";
    opendir( my $dh, $dir) || die "can't open $dir\n";
    my @blfs = grep { /^\w/ && -r "$dir/$_" } readdir($dh);
    close $dh;
    foreach my $blf (@blfs) {
        my $combined = $tlf . $blf;
        my $final = $combined + 0;
        my $docs = "$dir/$blf";
        my $target = "$destination/$final";
        print "move $docs -> $target\n";
        File::Copy::Recursive::rcopy($docs, $target);
    }
}
---- perl code ----

... This should copy disk0/00/00/02/42/01/something.pdf to 242/01/something.pdf

You can now remove the whole disk0 tree

   rm -rf disk0

Tidy up leading 0's

In the abstract pages, the URLs are 242/1/something.pdf, so we need to delete all the leading 0's:

---- perl code ----
use File::Slurp;

my $root = '.';

opendir( my $dh, $root) || die "can't open doc root\n";
my @tlfs = grep { /^\d/ && -d "$root/$_" } readdir($dh);
close $dh;

foreach my $tlf (@tlfs) {
    my $dir = "$root/$tlf";
    opendir( my $dh, $dir) || die "can't open $dir\n";
    my @blfs = grep { /^\d/ && -r "$dir/$_" } readdir($dh);
    close $dh;
    foreach my $blf (@blfs) {
        my $old = "$dir/$blf";
        my $fn = $blf + 0;
        my $new = "$dir/$fn";
        rename $old, $new;
    }
}
---- perl code ----

Deal with all those Absolute URLs

In a significant number of pages, hrefs are absolute, so need to be made relative. This snippet will fix that:

  find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +

Missing stylesheet & images

These will be under archives/<ARCHIVEID>/html/en/style on your EPrints server - copy the whole style across to your local style directory

Unwanted links to dynamic content

Things like seaching and RSS feeds need to be removed from all web pages. Try:

   find . -type f -exec sed -i '/rel="alternate"\|rel="Search"\|search\/simple\|ep_search_feed/d' {} +

The some of the abstract pages will have the Preview link - that needs to go:

   find . -name '*.html' -exec sed -r -i 's/(\s\|\s)?<a[^>]+>Preview<\/a>//' {} +

The abstract pages will also list the local Persistent URI - which goes to a URL that's not supported - that also needs to go:

find . -name '*.html' -exec perl -p -i -e 's/URI:<\/th> .*?<\/td><\/tr>//' {} +

You may have export options for some of your browse pages - they won't work:

   find view/ -name '*.html' -exec perl -p -i -e '$/=undef;s/<form\b[^>]*>.*?<\/form>//s' {} +

No point in having anything to suggest logins may work - so get rid of all the "Actions" sections:

    find . -name '*.html' -exec perl -p -i -e 's/<h3>Actions \(login required\)<\/h3>.*?<\/table>//' {} +

(that should be most of it...)

Get the web server to serve pages

.... and now you can point a web server's "document root" at /path/to/my.repo and it should all "just work" (you may have file-permissions to deal with - but that's not difficult)