Difference between revisions of "Making a fossilised repository"

From EPrints Documentation
Jump to: navigation, search
(Created page with "Category:Howto (This work requires access to the database & original file-system) The rough plan is to pull all of the visible pages in the repository and make it availa...")
 
Line 10: Line 10:
 
# you're working as a non-root user on a workstation
 
# you're working as a non-root user on a workstation
 
# your repo is accessable [to you] at http://my.repo/
 
# your repo is accessable [to you] at http://my.repo/
# your web server will have a document-root of ~/www/my.repo
+
# your web server will have a document-root of <code>~/www/my.repo</code>
 
# you know how to set a document-root on your web server
 
# you know how to set a document-root on your web server
 
# you'll have to do some tidying up (finding images, etc) yourself
 
# you'll have to do some tidying up (finding images, etc) yourself
Line 26: Line 26:
 
   for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done
 
   for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done
  
[(a) note the lack -f --mirror option, abd (b) this *will* take quite a while..... but are you in a hurry?]
+
[(a) note the lack of <code>--mirror</code> option, abd (b) this *will* take quite a while..... but are you in a hurry?]
  
 
This downloads each page into a file called $id, and we need to make that a directory:
 
This downloads each page into a file called $id, and we need to make that a directory:
Line 57: Line 57:
 
     scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .
 
     scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .
  
... will copy 'disk0' and all it's subdirectories into the current directory.
+
... will copy <code>disk0</code> and all it's subdirectories into the current directory.
  
 
=== Copy the documents into the right location in the web site ===
 
=== Copy the documents into the right location in the web site ===
Line 64: Line 64:
 
<pre>
 
<pre>
 
---- perl code ----
 
---- perl code ----
# you will need to tweek the root & number oif nested loops if you have more data.
+
# you will need to tweek the root & number of nested loops if you have more data.
 
use File::Copy::Recursive;
 
use File::Copy::Recursive;
  
Line 90: Line 90:
 
---- perl code ----
 
---- perl code ----
 
</pre>
 
</pre>
... This should copy disk0/00/00/02/42/01/something.pdf to 242/01/something.pdf
+
... This should copy <code>disk0/00/00/02/42/01/something.pdf</code> to <code>242/01/something.pdf</code>
  
You can now remove the whole disk0 tree
+
You can now remove the whole <code>disk0</code> tree
  
 
     rm -rf disk0
 
     rm -rf disk0
Line 98: Line 98:
 
=== Tidy up leading 0's ===
 
=== Tidy up leading 0's ===
  
In the abstract pages, the URLs are 242/1/something.pdf, so we need to delete all the leading 0's:
+
In the abstract pages, the URLs are <code>242/1/something.pdf</code>, so we need to delete all the leading 0's:
 
<pre>
 
<pre>
 
---- perl code ----
 
---- perl code ----
Line 125: Line 125:
 
== Deal with all those Absolute URLs ==
 
== Deal with all those Absolute URLs ==
  
In a significant number of pages, '''href'''s are ''absolute'', so need to be made relative. This snippet will fix that:
+
In a significant number of pages, <code>href</code>s are ''absolute'', so need to be made relative. This snippet will fix that:
  
 
   find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +
 
   find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +
 +
 +
== Missing stylesheet & images ==
 +
 +
These will be under <code>archives/<ARCHIVEID>/html/en/style</code> on your EPrints server - copy the whole <code>style</code> across to your local <code>style</code> directory
 +
 +
== Unwanted links to dynamic content ==
 +
 +
Things like <code>seaching</code> and <code>RSS feeds</code> need to be removed from all web pages. Try:
 +
 +
    find . -type f -exec sed -i '/rel="alternate"\|rel="Search"\|search\/simple\|ep_search_feed/d' {} +
  
 
== Get the web server to serve pages ==
 
== Get the web server to serve pages ==
.... and now you can point a web server's "document root" at /path/to/my.repo and it should all "just work"
+
.... and now you can point a web server's "document root" at <code>/path/to/my.repo<code> and it should all "just work"
 
(you may have file-permissions to deal with - but that's not difficult)
 
(you may have file-permissions to deal with - but that's not difficult)

Revision as of 08:37, 20 July 2017


(This work requires access to the database & original file-system)

The rough plan is to pull all of the visible pages in the repository and make it available as a static copy - no dynamic stuff, no

This process assumes you have access to the underlying database, and the file-store for the repo.

Assume:

  1. you're working as a non-root user on a workstation
  2. your repo is accessable [to you] at http://my.repo/
  3. your web server will have a document-root of ~/www/my.repo
  4. you know how to set a document-root on your web server
  5. you'll have to do some tidying up (finding images, etc) yourself

Grab a copy of the html pages

  mkdir ~/www
  cd ~/www
  wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache --mirror -nc -k http://my.repo/

Grab a copy of all the abstract pages

You need check the database - the last eprintid in the eprints table is the number you need to count up to

  cd my.repo
  for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done

[(a) note the lack of --mirror option, abd (b) this *will* take quite a while..... but are you in a hurry?]

This downloads each page into a file called $id, and we need to make that a directory:

---- perl code ----
# run in ~/www/my.repo
use File::Slurp;
my $root = '.';

opendir( my $dh, $root) || die "can't open doc root\n";
my @files = grep { /^\d/ && -f "$root/$_" } readdir($dh);
close $dh;

foreach my $file (@files) {
    my $source = "$root/$file";
    my $content = read_file($source, binmode => ':utf8');
    unlink $source;
    mkdir $source;
    write_file( "$source/index.html", {binmode => ':utf8'}, $content ) ;
}
---- perl code ----

Grab a copy of all the documents

The problem here is that EPrints has a funny directory structure, which doesn't map to URLs, so we grab a copy, and then move them into the right place. (Still in ~/www/my.repo/)

   scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .

... will copy disk0 and all it's subdirectories into the current directory.

Copy the documents into the right location in the web site

We then use a move script to move each of the documents into the appropriate abstract directory:

---- perl code ----
# you will need to tweek the root & number of nested loops if you have more data.
use File::Copy::Recursive;

my $root = './00/00'; 
my $destination = '.';

opendir( my $dh, $root) || die "can't open doc root\n";
my @tlfs = grep { /^\w/ && -r "$root/$_" } readdir($dh);
close $dh;

foreach my $tlf (@tlfs) {
    my $dir = "$root/$tlf";
    opendir( my $dh, $dir) || die "can't open $dir\n";
    my @blfs = grep { /^\w/ && -r "$dir/$_" } readdir($dh);
    close $dh;
    foreach my $blf (@blfs) {
        my $combined = $tlf . $blf;
        my $final = $combined + 0;
        my $docs = "$dir/$blf";
        my $target = "$destination/$final";
        print "move $docs -> $target\n";
        File::Copy::Recursive::rcopy($docs, $target);
    }
}
---- perl code ----

... This should copy disk0/00/00/02/42/01/something.pdf to 242/01/something.pdf

You can now remove the whole disk0 tree

   rm -rf disk0

Tidy up leading 0's

In the abstract pages, the URLs are 242/1/something.pdf, so we need to delete all the leading 0's:

---- perl code ----
use File::Slurp;

my $root = '.';

opendir( my $dh, $root) || die "can't open doc root\n";
my @tlfs = grep { /^\d/ && -d "$root/$_" } readdir($dh);
close $dh;

foreach my $tlf (@tlfs) {
    my $dir = "$root/$tlf";
    opendir( my $dh, $dir) || die "can't open $dir\n";
    my @blfs = grep { /^\d/ && -r "$dir/$_" } readdir($dh);
    close $dh;
    foreach my $blf (@blfs) {
        my $old = "$dir/$blf";
        my $fn = $blf + 0;
        my $new = "$dir/$fn";
        rename $old, $new;
    }
}
---- perl code ----

Deal with all those Absolute URLs

In a significant number of pages, hrefs are absolute, so need to be made relative. This snippet will fix that:

  find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +

Missing stylesheet & images

These will be under archives/<ARCHIVEID>/html/en/style on your EPrints server - copy the whole style across to your local style directory

Unwanted links to dynamic content

Things like seaching and RSS feeds need to be removed from all web pages. Try:

   find . -type f -exec sed -i '/rel="alternate"\|rel="Search"\|search\/simple\|ep_search_feed/d' {} +

Get the web server to serve pages

.... and now you can point a web server's "document root" at /path/to/my.repo and it should all "just work" (you may have file-permissions to deal with - but that's not difficult)