Difference between revisions of "Making a fossilised repository"
(Created page with "Category:Howto (This work requires access to the database & original file-system) The rough plan is to pull all of the visible pages in the repository and make it availa...") |
|||
Line 10: | Line 10: | ||
# you're working as a non-root user on a workstation | # you're working as a non-root user on a workstation | ||
# your repo is accessable [to you] at http://my.repo/ | # your repo is accessable [to you] at http://my.repo/ | ||
− | # your web server will have a document-root of ~/www/my.repo | + | # your web server will have a document-root of <code>~/www/my.repo</code> |
# you know how to set a document-root on your web server | # you know how to set a document-root on your web server | ||
# you'll have to do some tidying up (finding images, etc) yourself | # you'll have to do some tidying up (finding images, etc) yourself | ||
Line 26: | Line 26: | ||
for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done | for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done | ||
− | [(a) note the lack | + | [(a) note the lack of <code>--mirror</code> option, abd (b) this *will* take quite a while..... but are you in a hurry?] |
This downloads each page into a file called $id, and we need to make that a directory: | This downloads each page into a file called $id, and we need to make that a directory: | ||
Line 57: | Line 57: | ||
scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 . | scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 . | ||
− | ... will copy | + | ... will copy <code>disk0</code> and all it's subdirectories into the current directory. |
=== Copy the documents into the right location in the web site === | === Copy the documents into the right location in the web site === | ||
Line 64: | Line 64: | ||
<pre> | <pre> | ||
---- perl code ---- | ---- perl code ---- | ||
− | # you will need to tweek the root & number | + | # you will need to tweek the root & number of nested loops if you have more data. |
use File::Copy::Recursive; | use File::Copy::Recursive; | ||
Line 90: | Line 90: | ||
---- perl code ---- | ---- perl code ---- | ||
</pre> | </pre> | ||
− | ... This should copy disk0/00/00/02/42/01/something.pdf to 242/01/something.pdf | + | ... This should copy <code>disk0/00/00/02/42/01/something.pdf</code> to <code>242/01/something.pdf</code> |
− | You can now remove the whole disk0 tree | + | You can now remove the whole <code>disk0</code> tree |
rm -rf disk0 | rm -rf disk0 | ||
Line 98: | Line 98: | ||
=== Tidy up leading 0's === | === Tidy up leading 0's === | ||
− | In the abstract pages, the URLs are 242/1/something.pdf, so we need to delete all the leading 0's: | + | In the abstract pages, the URLs are <code>242/1/something.pdf</code>, so we need to delete all the leading 0's: |
<pre> | <pre> | ||
---- perl code ---- | ---- perl code ---- | ||
Line 125: | Line 125: | ||
== Deal with all those Absolute URLs == | == Deal with all those Absolute URLs == | ||
− | In a significant number of pages, | + | In a significant number of pages, <code>href</code>s are ''absolute'', so need to be made relative. This snippet will fix that: |
find . -type f -exec sed -i 's_http://my.repo/_/_g' {} + | find . -type f -exec sed -i 's_http://my.repo/_/_g' {} + | ||
+ | |||
+ | == Missing stylesheet & images == | ||
+ | |||
+ | These will be under <code>archives/<ARCHIVEID>/html/en/style</code> on your EPrints server - copy the whole <code>style</code> across to your local <code>style</code> directory | ||
+ | |||
+ | == Unwanted links to dynamic content == | ||
+ | |||
+ | Things like <code>seaching</code> and <code>RSS feeds</code> need to be removed from all web pages. Try: | ||
+ | |||
+ | find . -type f -exec sed -i '/rel="alternate"\|rel="Search"\|search\/simple\|ep_search_feed/d' {} + | ||
== Get the web server to serve pages == | == Get the web server to serve pages == | ||
− | .... and now you can point a web server's "document root" at /path/to/my.repo and it should all "just work" | + | .... and now you can point a web server's "document root" at <code>/path/to/my.repo<code> and it should all "just work" |
(you may have file-permissions to deal with - but that's not difficult) | (you may have file-permissions to deal with - but that's not difficult) |
Revision as of 08:37, 20 July 2017
(This work requires access to the database & original file-system)
The rough plan is to pull all of the visible pages in the repository and make it available as a static copy - no dynamic stuff, no
This process assumes you have access to the underlying database, and the file-store for the repo.
Assume:
- you're working as a non-root user on a workstation
- your repo is accessable [to you] at http://my.repo/
- your web server will have a document-root of
~/www/my.repo
- you know how to set a document-root on your web server
- you'll have to do some tidying up (finding images, etc) yourself
Contents
Grab a copy of the html pages
mkdir ~/www cd ~/www wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache --mirror -nc -k http://my.repo/
Grab a copy of all the abstract pages
You need check the database - the last eprintid in the eprints table is the number you need to count up to
cd my.repo for id in {1..12345} ; do wget --local-encoding=UTF-8 --remote-encoding=UTF-8 --no-cache -k http://my.repo/$id ; done
[(a) note the lack of --mirror
option, abd (b) this *will* take quite a while..... but are you in a hurry?]
This downloads each page into a file called $id, and we need to make that a directory:
---- perl code ---- # run in ~/www/my.repo use File::Slurp; my $root = '.'; opendir( my $dh, $root) || die "can't open doc root\n"; my @files = grep { /^\d/ && -f "$root/$_" } readdir($dh); close $dh; foreach my $file (@files) { my $source = "$root/$file"; my $content = read_file($source, binmode => ':utf8'); unlink $source; mkdir $source; write_file( "$source/index.html", {binmode => ':utf8'}, $content ) ; } ---- perl code ----
Grab a copy of all the documents
The problem here is that EPrints has a funny directory structure, which doesn't map to URLs, so we grab a copy, and then move them into the right place. (Still in ~/www/my.repo/)
scp -r eprints_user@my.repo:/path/to/eprints/archives/opendepot/documents/disk0 .
... will copy disk0
and all it's subdirectories into the current directory.
Copy the documents into the right location in the web site
We then use a move script to move each of the documents into the appropriate abstract directory:
---- perl code ---- # you will need to tweek the root & number of nested loops if you have more data. use File::Copy::Recursive; my $root = './00/00'; my $destination = '.'; opendir( my $dh, $root) || die "can't open doc root\n"; my @tlfs = grep { /^\w/ && -r "$root/$_" } readdir($dh); close $dh; foreach my $tlf (@tlfs) { my $dir = "$root/$tlf"; opendir( my $dh, $dir) || die "can't open $dir\n"; my @blfs = grep { /^\w/ && -r "$dir/$_" } readdir($dh); close $dh; foreach my $blf (@blfs) { my $combined = $tlf . $blf; my $final = $combined + 0; my $docs = "$dir/$blf"; my $target = "$destination/$final"; print "move $docs -> $target\n"; File::Copy::Recursive::rcopy($docs, $target); } } ---- perl code ----
... This should copy disk0/00/00/02/42/01/something.pdf
to 242/01/something.pdf
You can now remove the whole disk0
tree
rm -rf disk0
Tidy up leading 0's
In the abstract pages, the URLs are 242/1/something.pdf
, so we need to delete all the leading 0's:
---- perl code ---- use File::Slurp; my $root = '.'; opendir( my $dh, $root) || die "can't open doc root\n"; my @tlfs = grep { /^\d/ && -d "$root/$_" } readdir($dh); close $dh; foreach my $tlf (@tlfs) { my $dir = "$root/$tlf"; opendir( my $dh, $dir) || die "can't open $dir\n"; my @blfs = grep { /^\d/ && -r "$dir/$_" } readdir($dh); close $dh; foreach my $blf (@blfs) { my $old = "$dir/$blf"; my $fn = $blf + 0; my $new = "$dir/$fn"; rename $old, $new; } } ---- perl code ----
Deal with all those Absolute URLs
In a significant number of pages, href
s are absolute, so need to be made relative. This snippet will fix that:
find . -type f -exec sed -i 's_http://my.repo/_/_g' {} +
Missing stylesheet & images
These will be under archives/<ARCHIVEID>/html/en/style
on your EPrints server - copy the whole style
across to your local style
directory
Unwanted links to dynamic content
Things like seaching
and RSS feeds
need to be removed from all web pages. Try:
find . -type f -exec sed -i '/rel="alternate"\|rel="Search"\|search\/simple\|ep_search_feed/d' {} +
Get the web server to serve pages
.... and now you can point a web server's "document root" at /path/to/my.repo
and it should all "just work"
(you may have file-permissions to deal with - but that's not difficult)