Entire Manual

From EPrints

Jump to: navigation, search
Warning! This page is under development as part of the EPrints 3.0 manual. It may still contain content specific to earlier versions. Manuals for previous versions of EPrints are also available.

This page was generated on 2008-08-22

Contents

Introduction

What is EPrints?

EPrints 3.0 is generic repository building software developed by the University of Southampton. It is intended to create a highly configurable web-based repository.

EPrints is often used as an open archive for research papers, and the default configuration reflects this, but it is also used for other things such as images, research data, audio archives - anything that can be stored digitally.

The EPrints series began in early 2000 and is in use by over 200 sites!

Should I be installing EPrints 3, how much effort will it take?

Start by looking at http://demoprints3.eprints.org/ to get a feel for what the software does.

You can get a vanilla install up and running quite easily, installation notes on the wiki should help you over any snags relating to your operating system. You'll need a UNIX-like machine (linux is good), and a root password is helpful.

The task which will take longest is actually deciding what you want your repository to do (and not do). Many sites want to make significant customisations. EPrints creates a repository with a sensible default, but all our users want something slightly different.

Installing and configuring the software isn't too hard, and we're working on admin tools to make it even easier.

The time taken in running the archive day to day depends on your own policy. Do you want a very light touch on the data submitted or a formal review process on each item - that's up to you!

What will it run on?

We develop EPrints on Redhat Linux (both Fedora Core and Enterprise), but it is used on any number of Linux distributions, and other UNIX-like systems including OS-X. Thanks to support from Microsoft, it also runs on Windows Vista and XP.

EPrints doesn't require any unusual hardware. It's slightly easier to run on a dedicated machine, but that's not essential, and should not affect performance.

Don't forget to budget for a backup system, your data is valuable!

Required Software

Warning! This page is under development as part of the EPrints 3.0 manual. It may still contain content specific to earlier versions. Manuals for previous versions of EPrints are also available.

What Additional Software does EPrints Require?

In brief, EPrints requires Apache (with mod_perl), MySQL and Perl with some extra modules. Ideally you also want wget, tar and unzip.

EPrints bundles some perl modules which it uses, to save you installing them.

Where to get the Required Software

Almost all of the required software can be obtained through the yum (Fedora Core) or up2date (Red Hat Enterprise Linux) software management tools.

Fedora Core 5 also has a Package Manager tool under the Applications->Add/Remove Software menu.

Apache, MySQL, Perl and mod_perl can be installed during the installation of Fedora Core/RHEL (see Recommended Platforms).

Apache

FC% yum install httpd

(move to getting started) To make Apache start automatically when the machine is rebooted:

root% /sbin/chkconfig httpd on

PLEASE NOTE: EPrints 3.0 only supports Apache version 2, not version 1.3!

MySQL

FC% yum install mysql mysql-server

(move to getting started) To make MySQL start automatically when the machine is rebooted:

root% /sbin/chkconfig mysqld on

Perl

FC% yum install perl

mod_perl

FC% yum install mod_perl

GDOME

FC% yum install gdome2 gdome2-devel

Additonal Perl Modules

The majority of perl modules need by EPrints are already installed on Fedora Core/RHEL.

Install Unicode::String:

% yum install perl-Unicode-String

Install XML::GDOME from source:

% wget http://cpan.uwinnipeg.ca/cpan/authors/id/T/TJ/TJMATHER/XML-GDOME-0.86.tar.gz
% tar xzvf XML-GDOME-0.86.tar.gz
% cd XML-GDOME-0.86/
% perl Makefile.PL
% make
% make install

Note: Fedora Core 5 needs 2 extra perl modules for XML::GDOME:

% yum install perl-XML-SAX perl-XML-LibXML-Common

Other Tools

File uploads

wget, tar, gunzip and unzip are required to allow users to upload files as .tar.gz or .zip or to captures them from a URL.

These all come installed with most modern versions of linux. If you can't get them working, you can remove the option by editing "archive_formats" in SystemSettings.pm

Tested with wget 1.6.

If there are problems you may need to tweak how these are invoked in SystemSettings.pm

Full Text Indexing

The EPrints indexer requires various tools to extract plain (UTF-8) text from different types of document for indexing.

The full text indexer requires various tools to index each kind of document. These tools may or may not be already installed in your system. EPrints uses these tools to build a "words" file for each document (which contains the text of the document in UTF-8). If it can't run the tool, the "words" file will be empty and EPrints will not retry creating it unless you manually remove it.

PDF

Requires pdftotext which is part of the xpdf package.

FC% yum install xpdf

Microsoft Word

Requires wvText which is part of the wvWare package.

FC% yum install wv

Requires antiword, available from http://www.winfield.demon.nl/

antiword RPM packages for Red Hat, CentOS and Fedora

HTML

Requires the lynx tool (a text based browser)

FC% yum install lynx

LaTeX Tools

There is an optional feature which allows you to instruct EPrints to look in certain fields (e.g. title and abstract) for strings that look like LaTeX equations and render them as images. These tools are only required if you want to use this feature.

latex and dvips should already be available on Fedora Core and RHEL; if not:

FC% yum install tetex-latex

convert (part of the [ ImageMagick] package) should already be available on Fedora Core and RHEL; if not:

FC% yum install ImageMagick

This is a "cosmetic" feature, it only affects the rendering of information, so you can always add it later if you want to save time initially.

Other Platforms

The best place to get a software tool is the official site, but we've put a mirror of versions known to work at: http://www.eprints.org/files/tools/ - you don't need to install everything in the tools directory - just those described below.

Installing MySQL

Install a recent version of MySQL 3. You will need the .h and library files later to install the MySQL perl module. MySQL 4 is due soon, but we are not making plans to support it yet (if you try EPrints with MySQL 4 and it works, please let us know)

If installing from RPM you require: mysql-server, mysql-devel and mysql RPMs.

Compatability

EPrints 2.3 was tested with: 3.23.29a-gamma

Installing mod_perl

Apache is the most commonly used webserver in the world, and it's free! EPrints requires Apache to be configured with mod_perl, as this allows Apache modules that are entirely written in perl, hence providing much improved efficiency.

Get Apache from http://httpd.apache.org/dist/httpd/

EPrints requires that the apache module mod_perl is enabled.

Apache with mod_perl Installation - Step by Step

  • Download mod_perl and apache sources
  • Make mod_perl, I use this command (in the modperl src dir):

% perl Makefile.PL APACHE_PREFIX=/usr/local/apache \
APACHE_SRC=../apache-1.3.14/src DO_HTTPD=1 USE_APACI=1 \
EVERYTHING=1

Remeber to change ../apache-1.3.14/src to wherever your apache source is relative to this directory. The back slashes at the end of the line allow a single command to be split over multiple lines.

  • Make and install apache. From the mod_perl src dir, I use:

% make
% make install

( mod perl should have already run the apache ./configure script for us. )

Compatability notes

EPrints 2.3 Tested with: apache 1.3.14 with mod_perl 1.25

Installing Perl Modules

EPrints is currently begin developed with perl 5.6.1, there are currently no plans for to make EPrints run under perl 6 on the theory of if-it-ain't-broke-don't-fix-it.

Some perl modules are bundled with the EPrints2 package, others must be installed by you.

Installing a Perl Module

This describes the way to simple perl module, some require a bit more effort. We will use the non-existant FOO module as an example.

Some archives can be installed direct from CPAN. That's great when it works. It doesn't always work, but it's the quickest and easiest, so give it a go first. To install a perl module from CPAN run:


% perl -MCPAN -e 'install Foo::Bar'

Where Foo::Bar is the module you're installing.

I would like to make a list of which modules do/don't install OK from CPAN. If you're reading this before the end of Jan 2003, send me (Christopher Gutteridge) any comments on which ones worked, and on what operating system.

Download the archive. 
Either from cpan.org, or from the tools directory on eprints.org described at the top of this chapter. Our example archive is FOO-5.23.tar.gz.
Unpack the archive
 :

% gunzip FOO-5.23.tar.gz
% tar xf FOO-5.23.tar

Enter the directory this creates
 :

% cd FOO-5.23

Run the following commands
 :

% perl Makefile.PL
% make
% make test
% make install

Perl Modules Bundled with EPrints

You don't have to install these. They are included as part of the EPrints distribution.

XML::DOM, XML::RegExp, Filesys::DiskSpace, URI, Apache::AuthDBI, Unicode::Normalize, Proc::Reliable.

Please note that these modules are not part of the EPrints system and are only included to make things easier. Please note that XML::DOM has has a few lines commented out to prevent it requiring additional modules.

Required Perl Modules (Which you will probably have to install)

This modules are not built into EPrints - you must install them yourself. We recommend installing them in the order they are listed.

Data::ShowTable 
MySQL Interface Module requires this.
DBI 
Tested with: v1.14

MySQL Interface Module requires this.

Msql-Mysql Module 
Tested with: v1.2215

This one can be tricky. It requires access to .h and library files from MySQL. I install MySQL from source first, but some installs of MySQL don't put the lib and include dirs where this module expects. The answer to the first question is that you only need MySQL support. Under Red Hat's GNU/Linux distribution, the zlib-devel RPM should be installed before you install this module.

MIME::Base64 
Tested with: v2.11

Unicode::String requires this.

Unicode::String 
Used for Unicode support. No known problems. Tested with v2.06.
XML::Parser 
Tested with v2.30

Used to parse XML files. Requres the expat library. A .tar.gz and an RPM are available in the tools dir on eprints.org.

Apache 
The perl Apache.pm module is acutally part of mod_perl - installing mod_perl as part of Apache should also have installed the perl Apache module.

Since version 2.3.7 The modules "Apache::Request" and "Apache::Test" (aka. "libapreq") are no longer required. They were a pain to install and the software has been redesigned to not use them at all.

Required Perl Modules (Which you will probably already have)

Most PERL 5.6 or later systems should already include the following modules, but you may have to install some by hand on certain platforms.

CGI, Carp, Cwd, Data::Dumper, Digest::MD5, File::Basename, File::Copy, File::Find, File::Path, Getopt::Long, Pod::Usage, Sys::Hostname.

Installing GDOME

Since EPrints 2.2 you may use either XML::DOM or XML::GDOME. XML::GDOME is recommended as it's faster and uses much less RAM, but it does require you to install a whole lot of extra libraries and perl modules. If you are running a pilot or demonstration service then XML::DOM is fine, and you can always switch over later by installing the required tools and setting the GDOME flag in perl_lib/EPrints/SystemSettings.pm

Addional Libraries Required for GDOME support

libxml2
libxml2-devel

either get the tarball from: ftp://ftp.gnome.org/pub/GNOME/sources/libxml2/

or the RPMs (but we have had problems with complex RPM dependencies):


http://rpmfind.net/linux/rpm2html/search.php?query=libxml2
http://rpmfind.net/linux/rpm2html/search.php?query=libxml2-devel

The GDOME Library

Obtain this from


http://gdome2.cs.unibo.it/#downloads

You may either use the RPMs (gdome2 and gdome2-devel) or the tarball.

Additional Perl Modules Required for GDOME support

XML-LibXML-Common
XML-NamespaceSupport
XML-GDOME

All of which are in http://www.cpan.org/modules/by-module/XML/

See Also

Installation

Warning! This page is under development as part of the EPrints 3.0 manual. It may still contain content specific to earlier versions. Manuals for previous versions of EPrints are also available.

Installation

(If you are upgrading an existing installation of eprints please see the section on upgrading elsewhere in this manual.)

EPrints needs to be installed as the same user as the apache webserver runs as. We suggest you install it as user "eprints" and group "eprints". Under some UNIX platforms, creating a user and group can be done using the "adduser" command. Otherwise refer to your operating system documentation.

Unpack the eprints tar.gz file:

% gunzip eprints-3.something.tar.gz
% tar xf eprints-3.something.tar

Now run the "configure" script. This is a /bin/sh script which will attempt to locate various parts of your system such as the perl binary. It will also check your system for required components.

% cd eprints-3.something
% ./configure

By default the system installs as user and group "eprints". You will need to change this if you are not installing as either "root" or "eprints".

The configure script accepts a number of options.

--help 
List all the options (many are intended for compiled software and are ignored).

Recommended:

--prefix=PREFIX 
Where to install EPrints (or look for a version to upgrade). By default /opt/eprints3/
--with-smtp-server=[HOST] 
Use HOST to deliver mail. If the server running EPrints has an MTA such as exim or sendmail, you can specify localhost. If you do not specify this option, you will get a warning to configure it later.
--with-user=[USER] 
Install eprints to run as USER. By default "eprints".
--with-group=[GROUP] 
Install eprints to run as GROUP. By default "eprints".

Optional:

--with-perl=[PATH] 
Path of perl interpreter (in case configure can't find it, or you have more than one and want to use a specific one).
--with-virtualhost=[VIRTUALHOST] 
Use VIRTUALHOST rather than * for apache VirtualHost directives.
--with-toolpath=[PATH] 
An alternate path to search for the required binaries.
--disable-diskfree 
Disable disk free space calls. These can cause problems on some platforms, notably 64-bit.

Deprecated:

--with-apache=1 
Use Apache 1.x.x instead of 2.x.x, but EPrints 3 does not support this.

Once you are happy with your configuration you may install eprints by running install.pl:

% ./install.pl

Now you should edit the configuration file for your copy of apache. This is often /usr/local/apache/conf/http.conf or /etc/httpd/conf/httpd.conf

Add this line: (If you didn't install eprints in /opt/eprints3/ replace that with the location on your system).

Include /opt/eprints3/cfg/apache.conf

Note that this file is only available after you created your archive via epadmin create. See Running epadmin for more information on creating an archive.

You may also wish to change the user and group apache runs as. The user must be the same as the user you installed eprints as. We recommend:

User eprints
Group eprints

Structure

Warning! This page is under development as part of the EPrints 3.0 manual. It may still contain content specific to earlier versions. Manuals for previous versions of EPrints are also available.

Terms

This is a definition of some terms used in the eprints documentation and comments. Many of these are "objects" within the code and the perl module which handle them is listed.

repository

EPrints::Repository

An archive is a eprints archive with it's own website configuration and data. One install of the eprints software can run serveral seperate archives. Sharing code but with totally different configurations. Before EPrints 2.4 this was known as EPrints::Archive. This was changed to avoid confusion with the eprint status of "archive".

session

EPrints::Session

A session is created every time a cgi script or a bin script is executed, and terminated afterwards.

eprint

EPrints::DataObj::EPrint

An eprint is a record in the system which has one or more documents and some metadata. Usually, more than one document is to provide the same information in multiple formats, although this is not compulsary. Pre 2.4 this was known as EPrints::EPrint.

document

EPrints::DataObj::Document

A document is a single format of an eprint, eg. HTML, PDF, PS etc. It can contain more than one file, for example HTML may contain more than one html page + image files. The actual files are stored in the filesystem. Pre 2.4 this was known as EPrints::Document.

user

EPrints::DataObj::User

A user registered with the system. (NOT necesarily the author of the eprints they deposit). Pre 2.4 this was known as EPrints::User

subject

EPrints::DataObj::Subject

A subject has an id and a list of who it's parents are. There is a build in subject with the id "ROOT" to act as the top level. A subject can have more than one parent to allow you to create a rich lattice, rather than just a tree, but loops are not allowed.

type or usertype or eprinttype 
users, eprints and documents all have a "type". This controls how they are "cited" and also for users and eprints it controls what fields may be edited, and which are required.

dataobj (aka "item")

EPrints::DataObj

The "super class" of subjects, users, eprints and documents etc. In the very core of the system these are all treated identically and much of the configuration and methods of these classes of "thing" are identical. We use the term item to speak about the general case.

dataset

EPrints::DataSet

A dataset is a collection of items of the same type. It can be searched. Some datasets all have the same "config id". The "config id" is used to get information about the dataset from the archive config - inbox, buffer, archive and deletion all have the same metadata fields and types. Core datasets are:

DATASET ID COMMENT
eprint EPrint records are the core of the system.
user Users registered with the system.
subject The subject tree.
document Documents belonging to EPrints. Every document is part of an EPrint record.
subscription Subscriptions made by users. Every subscription is a part of a Subscription record.
history Stores actions performed on records.

In addtion to these datasets are four virtual datasets: inbox, buffer, archive and deletion. These act just like "eprint" except that they are filtered to only contain records with those status.

Note that prior to 2.4 the "eprint" dataset was virtual, rather than "inbox", "buffer" etc. The History dataset was introduced in 2.4.

database

EPrints::Database

The connection to the MySQL back end. datasets are stored in the MySQL system, but you do not have to address it directly.

fields (or "metadata fields")

EPrints::MetaField

A single field in a dataset. Each dataset has a few "system" fields which eprints uses to manage the system and then any number of archive specific fields which you may configure.

subscriptions

EPrints::Subscription

Some software refers to this concept as "alerts".

A stored search which is performed every day/week/month and any new results are the mailed to the user who owns the subscription.

This diagram does not show "Subscription". Subscription is a subclass of DataObj (like EPrint, User etc.). A Subscription is associated with one User. A User is associated with 0..n Subscription's.

EPrints Configuration

Warning! This page is under development as part of the EPrints 3.0 manual. It may still contain content specific to earlier versions. Manuals for previous versions of EPrints are also available.

This page deals with configuring the software.

See also: repository configuration

EPrints General Configuration

This section describes all the configuration files in the EPrints system which do not relate to any specific archive.

EPrints Configuration Directory

The general EPrints configuration directory is usually /opt/eprints2/cfg/ and contains the following files:

apache.conf 
This file is generated by generate_apacheconf. See the documentation of generate_apacheconf for more information.
auto-apache.conf 
This file is generated and overwritten by generate_apacheconf. Do not edit it directly. See the documentation of generate_apacheconf for more information.
auto-apache-includes.conf 
This file is generated and overwritten by generate_apacheconf. Do not edit it directly. See the documentation of generate_apacheconf for more information.
languages.xml 
This XML file contains an (exhaustive) list of all ISO language ID's and their names.
system-phrases-languageid.xml 
One of these files per language needed for any archive in this system. These files contain the phrases needed to render the website and email in each language, not counting names of things like metadata fields which vary between archives. It should not be edited by hand, but may be overridden. See the instructions on phrase files in the archive config documentation.
SystemSettings.pm 
Described below.

SystemSettings.pm

This is a perl module which is created and edited by the eprints installer script when installing or upgrading EPrints. It's found in perl-lib/EPrints/

SystemSettings contains system specific things:

base_path 
The root directory of your eprints install. Normally /opt/eprints2/
executables 
A hash of the path of various external commands such as sendmail and wget.
invocation 
A hash of how eprints is to invoke various external commands. The variables with uppercase names - $(FOO) - are replaced with parameters from eprints, the lowercase names - $(sendmail) - are replaced with the strings in executables.
archive_formats 
An array of id's of archive formats offered in the upload document page. For each their must be an entry in the archive_extension and invocation, $(DIR) is the where eprints wants the contents of the archive and $(ARC) is the archive file.
version_id  
The id of the current eprints version.
version  
The human readable version number.
user  
The UNIX user eprints will run as. Usually "eprints".
group  
The UNIX user eprints will run as. Usually "eprints".
virtualhost (Since v2.1) 
If this is set, it is used for the VirtualHostName in the Apache configuration files. (By default EPrints uses "*").
disable_df (Since v2.1) 
If this is set to 1 then this disables the parts of EPrints which use the df call (disk free). If the "configure" script tested the "df" command and found that it failed the this function will initially be set to 1, otherwise 0.
enable_gdome (Since v2.2) 
If this is set to 1 then it enables the use of the XML::GDOME module, rather than XML::DOM. XML::GDOME is faster and less memory intensive but depends on a number of other libraries and modules which are not worth installing for a trial system.

Repository Configuration

Warning! This page is under development as part of the EPrints 3.0 manual. It may still contain content specific to earlier versions. Manuals for previous versions of EPrints are also available.

EPrints Archive Configuration

This section describes all the configuration files in an single archive in the EPrints system.

Primary archive configuration file

Once you have created an EPrints archive the information you entered is placed in an XML file in /usr/local/eprint2/archives/ with the name archiveid.xml - this file is documented later in this section.

Archive configuration directory

The bulk of the archive configuration is copied from /opt/eprints2/defaultcfg/ into the archives own configuration directory (usually /opt/eprints2/archives/archiveid/cfg/ This directory will usually contain the following files and directories:

apache.conf 
This file is generated by generate_apacheconf. See the documentation of generate_apacheconf for more information.
apachevhost.conf (added v2.2) 
This file is generated by generate_apacheconf. See the documentation of generate_apacheconf for more information.
ArchiveConfig.pm 
The general configuration items which don't fit anywhere else are in this perl module. It is described fully later in this section of documentation. This module "requires" the other 5 perl modules. They are in seperate files to make them easier to get to grips with.
ArchiveMetadataFieldsConfig.pm 
This module configures the metadata fields and the default values.
ArchiveOAIConfig.pm 
This module configures how the archive exports itself via the Open Archives protocol.
ArchiveRenderConfig.pm 
This module contains subroutines which handle rendering the data into XHTML (mostly) for display as webpages.
ArchiveTextIndexingConfig.pm 
This module handles turning UTF8 text strings into lists of index words for free text searches.
ArchiveValidateConfig.pm 
This module contains subroutines which check the metadata for problems.
auto-apache.conf 
This file is generated and overwritten by generate_apacheconf. Do not edit it directly. See the documentation of generate_apacheconf for more information.
citations-languageid.xml 
One of these files for each languageid supported by this archive. These XML files describe how to turn metadata for an item into a citation (with markup). They are described fully later in this section of documentation.
entities-languageid.dtd 
One of these files for each languageid supported by this archive. These DTD files are generated automaticly just before eprints loads the archives configuration and should not be edited directly.
metadata-types.xml 
This XML file describes the various types of eprints, users etc. and which metadata fields are required or relevant to each. It is described fully later in this section of documentation.
phrases-languageid.xml 
One of these files for each languageid supported by this archive. These XML files contain all the phrases which are specific to this archives such as the titles of metadata fields. They are described fully later in this section of documentation.
ruler.xml 
This XML file just contains the horizontal divider used in webpages created by the system. It is described fully later in this section of documentation.
static/ 
This directory contains the data needed to create the static webpages such as the homepage, and about page. It is described fully later in this section of documentation.
subjects 
This file contains the initial subjects for the system. It is described fully in the documentation for import_subjects.
template-languageid.xml 
One of these files for each languageid supported by this archive. These XML/XHTML files describe the outline for webpages for this system. They are described fully later in this section of documentation.

XML Config Files in EPrints

This section contains some general information about the XML archive config files: template, phrases, ruler and citations. metadata-types.xml uses XML but these comments do not apply.

XHTML

These files use HTML elements (and other elements too). XHTML is a fairly new version of HTML which is backwards compatable with HTML 4 but written using XML, not SGML. This means that it is much stricter but less ambiguous and easier to parse and modify. Assuming you know HTML, the main differences are as follows:

All tags must be closed 
All elements must be closed, even ones such as <li>. Tags which do not have a close tag in HTML, like <br> or <img src="foo"> still must be closed eg. <img src="foo"></img> - this can be abbreviated as: <img src="foo" />
All tags and attributes must be lower case 
Self explanitary.
Strict definition of what tags may appear within others 
Not actually checked by EPrints. It will let any rubbish past as long as it's valid XML. But that's no reason to be naughty.
All attributes must be wrapped in quotes 
In HTML the values of attributes do not have to be wrapped in quotes, but in XML (and therefore XHTML) they do.
All attributes must have a value 
In HTML some attribues do not require a value, for example <hr noshade> In XHTML it is represented as <hr noshade="noshade" />

So in summary, the HTML:

<img SRC=someurl>
<hr NOSHADE WIDTH=2>
<P>Foo bar</P>

should become in XHTML:

<img src="someurl" />
<hr noshade="noshade" width="2" />
<p>Foo bar</p>

And that's more or less it. See http://www.w3c.org/ for a complete description.

Language specific files.

phrases, template and citations have one instance per supported language. This allows the system to generate pages and emails in more than one language. Supporting a new language will require translating the all the english in the english config files currently shipped. If you do intend it do this (lots of work!) please get in touch with the eprints admin so that we can avoid duplicated effort.

Extra Entities

The XML files all use a DTD which defines a few extra entities. Entities are items in XML (or HTML) which start with "&" and end with ";" like &amp;. These additional entities come from the entities DTD file created by generate_entities. One DTD is created per language, although currently the only variation is the archive name.

&archivename; 
The name of the archive in the current language.
&adminemail; 
The administrators email address.
&base_url; 
The base URL of the system (without a trailing slash)
&perl_url; 
The base URL of the CGI directory (without a trailing slash)
&frontpage; 
The URL of the system homepage.
&userhome; 
The URL of the user homepage.
&version; 
The current EPrints version.
&ruler; 
The XHTML of the standard divider.
Any XHTML character entity (since EPrints v2.1) 
You may now use any XHTML character entity, eg. &nbsp; &eacute; &euro;.
User configured entities 
You can generate your own entities by modifying the function which generates them in ArchiveConfig.pm

None of these entities are not available in the citations file or the ruler file.

Name Spaces and XHTML

These files contain a mixture of custom tags and XHTML. To keep these distinct the XML files contain a name space definition in the first element. The pratical upshot is that all EPrints own tags have the prefix "ep:". The namespace information is actually ignored by the current version of the eprints system.

example of mixed tags (and entities):


<ep:phrase ref="lib/session:contact"><p>Feel free to contact 
<a href="mailto:&adminemail;&quot;>&archivename; administration</a> 
with details.</p></ep:phrase>
 
eprints elements: phrase
xhtml elements: p, a
eprints entities: archiveemail, archivename

The Primary Archive Configuration File

This XML file appears in the archives/ directory, usually /opt/eprints2/archives/, it describes the most very basic details about the archive. It is generated (and modified) by configure_archive and will not normally need to be edited.

EPrints looks in this directory for XML files and attempts to load them all when starting the webserver.

This file should be chmod'd so that it can not be read by random users as it contains the database password.

The top level element is "archive" which has the attribute "id" which is the id of the archive. It should be the same as the filename. If this file is foo.xml then the id should be foo.

<archive> contains a list of XML tags enclosing some text. eg.


 <host>stoatprints.org</host>

The following tags are expected in no special order:

<host> 
The hostname of this archive.
<alias redirect="yes-or-no"> 
This is optional and may be repeated. It has the attribute "redirect" which may be set to yes or no. This controls what virtual hosts are supported and if they should redirect to the main <host>.
<language> 
The ISO id of a language supported by this archive. Repeatable. One of these should also be the defaultlanguage. See below.
<port> 
The port number that the server is running on. Usually 80.
<urlpath> 
The directory from the root of the server name. Usually /
<archiveroot> 
The filesystem path of the rest of the archive configuration.
<configmodule> 
The path to the perl module which does the main configuration (ArchiveConfig.pm)
<dbname> 
The name of the MySQL database. Usually the same as the archive ID.
<dbhost> 
The host on which MySQL is running. Usually localhost.
<dbport> 
An optional MySQL port, if it's not the standard one. Should be empty if we are to use the default.
<dbsock> 
An optional MySQL socket. Should be empty if we are to use the default.
<dbuser> 
The username to use when connecting to MySQL, usually "eprints".
<dbpass> 
The password to use to connect to MySQL.
<defaultlanguage> 
One of the supported language. This is the default for this archive.
<adminemail> 
The email address of the archive administrator. I strongly suggest that this is an alias rather than a personal email address. If all your webpages contain "bob@footle.edu" and bill takes over from bob you would have to regenerate every page with "bill@footle.edu". Much better to set up an email alias or forward from "archive-support@footle.edu" and point it at bob (for now). Heed these words spoken from grim experience!
<archivename language="langcode"> 
The name of the archive. This has an attribute "language" the value of which is an iso language id. There should be one of these archivename elements per supported language. eg.

   <archivename language="en">White Lemur</archivename>
   <archivename language="fr">La Archive d'Lemur Blanc</archivename>

(apologies to the french, human languages aren't my strong suit)

<securehost> (since v2.2) 
Used for experiemental https support.
<securepath> (since v2.2) 
Used for experiemental https support.

ArchiveConfig.pm

This module imports the other 5 perl modules. It allows lots of little tweaks to the system, which are all commented in the file.

It includes options to hide various features you may not want and to customise the browse, search and subscription functions.

Also you can customise what each type of user can and can't do, and how they authenticate their passwords.

This configuaration file contains perl methods which are called when a session starts and ends, to log things, to generate the entities for the entities file and security on non public files.

Browse Views

The browse views are generated by the script "generate_views" and what that script does is configured by the "browse_views" item in the config.

It is a reference to a perl array [], each item of which is a hash {}.

The hash has 3 required properties and a number of optional ones.

id (required) 
The ID of this view - the view will be placed in a subdirectory of /views/ of this name. The ID is also used to identify the full name of this view in the phrase file. id=>"foo" would find it's title in the phrase "viewname_eprint_foo"
fields (required) 
The list of the names of the fields to browse, seperated by a slash "/". This should normally be a single field unless you want to merge the values of two fields. The id part of a field may be specified by appending ".id" to the fieldname.
order (required) 
A list of fields to sort by in order of priority, sepearted by slashes "/". A minus sign prefixing the fieldname "-" indicates reverse sorting on that field.
allow_null 
Should we make a page for the "unset" condition? A page for items which do not have a year set may be useful. But for other fields this may be meaningless. Set it to 1 for true.
include 
Generate a file for every value, ending in ".include" which contains the XHTML of the citations of records and the number of records, but without wrapping the site standard template around it.
nohtml 
Normally the system generates a page like that described for "include" with a .html suffix and the site template. If nohtml is set to 1 then it won't.
citation 
Normally the citation used is that for the "type" of eprint. If this is set then that citation (from the citations file) will be used for all items. This allows for some clever stuff if you want to make page which can get sucked into another website.

Normally the system puts a paragraph tag around each citation, but if you use a custom citation this will not happen.

nocount 
Do not include the count of how many items at the top of the page.
nolink 
The system generates an index.html in /view/ with a list of all the browse views available. Setting nolink to 1 will hide this item.
noindex 
Do not generate an index.html file in /view/foo/ listing all the values of the view and linking to their respective pages.
notimestamp (since v2.2) 
Do not add the timestamp at the bottom of the view page.
hideempty (since v2.2) 
Only applicable to subjects. This option will supress subjects which do not have any records in. This is useful on "young" archives which look very empty if you have a large subject tree and only a few records, and those clustered in 3 or 4 subjects.

The most common view is to browse by subject:


{ id=>"subject", allow_null=>0, fields=>"subjects", 
   order=>"title/authors", hideempty=>1 }

A more complex view generates a view on author & editor ID's which are not advertised but may be captured by some other software to build staff CV pages.


{ id=>"person", allow_null=>0, fields=>"authors.id/editors.id", 
   nohtml=>1, nolink=>1, noindex=>1, include=>1, 
   order=>"-year/title" }

For my example person id "wh" this will generate a webpage called /view/person/wh.include (and one for each other value of authors or editors ID's) which can be captured by an external automated system.

User Privs

The user permission configuration allows you to set what types of user can and can't do. The user home page will only show a user options which they can do.

New types of user, and which data about themselves they can edit is set in metadata-fields.xml.

Permissions are set by "type" of user. By default there are 3 kinds of user: "user", "editor" and "admin".

Admin can, by default, do everything.

subscription (since EPrints v2.1) 
If included then this kind of user can create subscriptions.
set-password 
Reset their password via the web registration system.
deposit 
Submit items into the archive.
view-status 
View the archive status page.
editor 
User can edit then approve submitted items into the main archive, or delete them, or return them to sender. Also can remove items from the archive back into the edit buffer for corrections, and move records into the deleted table (delete them).
staff-view 
User can perform a "staff search" of user or eprint records and view ALL the metadata.
edit-subject 
User can edit the subject tree via the online interface.
edit-user 
User can edit other users records.
change-email 
User can change their email address via the web interface. This is safer than allowing them to edit it directly as it ensures they cannot set it to an address which they recieve (it mails them a confirmation pin number)
change-user 
This allows the sinister feature which lets you log in as someone else. It still requires a password. This is useful if you want to perform admin tasks as a super user, then log-in as a normal user to deposit items.
no_edit_own_record (since v2.2) 
This supresses the "edit my user record" option. This may be useful if you disable web-registration and import the user records from some other database.

ArchiveMetadataFieldsConfig.pm

Fields Configuration

Metadata is data about data. The information which we store to describe each record (eprint) in the system. Users also have metadata.

This module is the configuration for the metadata. This is probably the most important part of the system.

See the chapter on metadata for all the configuration options.

Defaults

This section of the file contains subroutines which are called to set default values for Users, Documents and EPrints.

Automatics

These functions let you set automatic fields. This allows you to make fields which are updated automatically each time the item (User/EPrints/Document) is commited to the database.

This allows you to create "compound" fields. Such fields are created by processing the values of other fields rather than being edited directly.

For example, if you wanted to make an automatic int field which contains the number of authors, you could add the following to set_eprint_automatic_fields:


# no authors at all will be undef, not [] so check first
if( $eprint->is_set( "authors" ) )
{
       my $auths = $eprint->get_value( "authors" );
       $eprint->set_value( "authcount" , scalar @{$auths} );
}
else
{
       $eprint->set_value( "authcount" , 0 );
}

ArchiveOAIConfig.pm

This module configures how the archive exports its data via the OAI protocol.

For more inforamtion on the how and why of OAI see http://www.openarchives.org/

OAI allows a harvestor to request the metadata from your archive and other archives to provide a federated search. The next time the harvestor harvests your archive it only has to ask for items which have changed or been added since last time it asked.

The current version of EPrints supports OAI v2.0. OAI version one is no longer supported.

The base URL for your OAI v2.0 interface will be http://archivepath/perl/oai2

If you want to use the OAI system then you need to fill in the blanks, such as policy and the OAI-id of the archive.

You may create OAI sets in a similar manner to "browse views" in ArchiveConfig.pm.

If you want to change the way that an EPrint is mapped into Dublin Core then edit the make_metadata_oai_dc - which returns a DOM XML object.

To add a new metadata type you need to add a new mapping function and add entries to the namespaces, schemas and functions items near the top of the file.

ArchiveRenderConfig.pm

This module contains fuctions which turn data into XHTML for displaying on the web.

If you want to change the way a user info page, or an eprint "abstract" page is rendered then here's the place to do it.

There are also "full" versions of these functions which display all the internal variables and things. These are the views which the editors and admin see.

The XHTML is generated using DOM (Document Object Model), but eprints provides some functions for easily generating XHTML DOM. The only method of DOM you should need to use is appendChild - which adds an element to this element.

EPrints API functions which return XHTML objects.

Note, all text strings should be in UTF-8.

Example:


my $page = $session->make_doc_fragment(); 
my $h1 = $session->make_element( "h1" );
$h1->appendChild( $session->make_text( "Title" ) );
$page->appendChild( $h1 );
$page->appendChild( 
   $session->make_element( 
      "img",
       src=>"/images/cheese.gif",
       width=>128,
       height=>53 ) );

$page now contains:


<h1>Title</h1><img src="/images/cheese.gif" width="128" height="53" />

Many of the EPrints modules are fully documented. For an example try running:


% perldoc /opt/eprints2/perl_lib/EPrints/Archive.pm

The functions most useful to extacting and rendering information are documented here:

$session->make_text( $text )  
Returns a DOM object representing that text.
$session->make_doc_fragment()  
Returns a document fragment. This renders to nothing but is a container to which you can add stuff.
$session->make_element( $name, %opts )  
Makes a simple XHTML element. %opts is an optional series of attributes.

To make <h1 class="foo">...</h1> you would call:

$session->make_element( "h1", class=>"foo" );

$session->render_ruler();  
Returns the default ruler for the archive (from ruler.xml).
$session->render_link( $uri, $target )  
Returns the XHTML element (with URI properly escaped):

<a href="uri"></a>

Which you can appendChild stuff into. If $target is specified then a target attribute is included - to make it pop up a new window.

$item->render_value( $fieldname, $showall )  
$item is either an EPrint, a User or a Document.

$fieldname is the name of the field you want to render. If $showall is 1 then ALL values are rendered in a multilang field.

$item->render_citation( $style )  
Renders the citation of the item using the citation for the item's type from the citation file.

If $style is set then it uses the citation with that id instead.

$item->render_citation_link( $style )  
This renders a citation as above, but links it to the url of the item.
$item->render_description()  
This renders a simple description of the item using the default citation for this dataset eg. for eprint it uses citation type "eprint".
$session->html_phrase( $phraseid, %opts )  
Returns the item from the phrase file. If you don't care about supporting multiple languages then just use make_text instead, it's easier.

It looks first in the archive field from the current language. Then in the archive phrase file for english. Then is the system phrase file for the current language. Then is the system phrase file for the english. The %opts are a series of DOM elements to place in the "pin" items in the phrase file.

Some other useful functions you may need

$item->get_value( $fieldname, $no_id )  
Returns the value of field $fieldname from the item. An optional second parameter may be set to 1 to return the value without the "id" part, to keep things simple.
$item->is_set( $fieldname )  
Returns true if the field is set on this object, false otherwise.
$eprint->get_all_documents()  
Return an array of the document objects belonging to this eprint.

ArchiveTextIndexingConfig.pm

This module you probably won't need to change unless you want to modify how eprints does searches for words in strings.

When a record is added to the system eprints uses this module to turn a string into a list of values which are indexed. By default these are words with 3 letters or more except some predefined stop words. It also turns latin characters with acutes into the their plain ascii (no acute/grave) versions.

It then does the same with the search string and looks for these keys.

Example:


The rain in spain falls mainly on the plains.

Is turned (by default) into the keys:


rain spain fall mainly plain

Thus searching for "rain" or "plain" or "plains" or "MaiNlY" will all match this string.

You may wish to add your own "stop words". eg. If you are running an archive about badgers, a search for the word "badger" will return almost all the records.

At a more complex level you may wish to add handling for non-european character sets (I have no idea how well the default setting will work on these), or do "stemming" - removing "ed", "ing", "ies", "s" etc. from the end of words so that "land" will match "land", "landed", "landing" and "lands". (It current removes 's').

Another suggestion is using soundex or similar techniques to match words which sound similar.

Changing the indexing on a live system will require you to regenerate the indexes using the reindex script. (If you don't then some of the search results will be wrong).

ArchiveValidateConfig.pm

This module handles validating data entered by users. Each subroutine is described in more detail in the module itself.

Each subroutine returns a list of DOM elements, each of which describing a single problem. Any problems will prevent the user from continuing with editing until they correct the problems.

As with the rendering functions, if you don't care about making this work in more than one language then you can just make the DOM items by calling $session->make_text( "problem explanation" )

The eprint & document validation routines have a flag $for_archive which, if true, indicates that the item is being checked before going into the actual archive. You can use this to force an editor to enter fields which the user may leave blank.

Validation Functions

validate_field 
Called for all fields. Use it to check individual field values. By default checks that url's look OK.
validate_eprint_meta 
Check the metadata of an eprint. Use this to test dependencies between fields. eg. if you have a requirement that field "A" OR field "B" must be set.
validate_eprint 
Validate the whole eprint. The last part of the validation of an eprint.
validate_document_meta 
Validate the metadata of the document (as with eprint_meta)
validate_document 
Validate the whole document, files and metadata.
validate_user 
Validate a user record.

citations-languageid.xml

The ciations file describes how to render an item (eprint/user/whatever) into a short piece of XHTML. Each citation has a "type". There are 3 kinds of citation:

default citation 
This is a very short description of the item. Usually "the title or failing that, the id". The type id is just the name of the dataset. eg. "eprint"
type citation 
These are richer descriptions which vary between type of eprint, user or document. The type id is dataset_type eg. eprint_preprint.
other citation 
Used by custom browse views. Any name you like.

The citation file contains a list of citation elements:


<ep:citation type="..."> Each one may contain text and tags. The text may also include the names of fields in the record being rendered. These names should be between @ symbols. eg. @authors@ or @title@. These will be replaced with a rendered version of the value in that field. (if you need an actual @ symbol for some reason two @@ with nothing inside will be rendered as a single @).

Note. The @title@ style was introduced in EPrints 2.2. Before that this file used XML entities such as &title; but this caused problems and didn't solve any. Use of entities is still supported, but deprecated.

In addition you may use XHTML elements and the following elements in the eprints namespace. These elements are always removed but they control if their contents is kept or not. Conditional elements may be placed inside each other since v2.2.

<ep:linkhere>  
This element is replaced with an XHTML anchor linking to the item. If this citation is being rendered without a link then it is just removed (but not the contents).
<ep
iflink> : The contents of this element are only preserved if we are rendering this citation as a link. Maybe an icon which you don't want if it's not a link.
<ep
ifnotlink> : The opposite of iflink.
<ep
ifset ref="fieldname"> : The contents of this element are only preserved if the field "fieldname" has a value.
<ep
ifnotset ref="fieldname"> : The contents of this element are only preserved if the field "fieldname" does not have a value.
<ep
ifmatch name="fieldname(s)" value="searchparam"> : This is the swiss army knife of the world of conditional rendering. It is also a bit complicated, and few people will need to use it. This actually works like a single search element. The attributes are:
name 
This is the name of one or more fields, specified as in the search fields configuration. eg. "title/abstract"
value 
This is a value to search for. Treated like the value entered in a search field.
merge (optional) 
Can be ANY or ALL. Works like the match all? in a search form.
match (optional) 
Can be IN, EQ, or EX. In, Equal or Exact. Exact on subjects means that subject, but not any below it in the heirarchy.

For example:

@year@<ep:ifmatch name="year" value="-1949"> (approx)</ep:ifmatch>

This will render (approx) after years before 1950. Neat eh?

<ep
ifnotmatch name="fieldname(s)" value="searchparam"> : Like ifmatch but only includes the values inside if the search does not match.

metadata-types.xml

This file allows you to configure the types of eprint, user, document and document security level.

When you add a new type you should add it's name to the archive phrases file(s). The phraseid is "dataset_typename_typename" eg. "document_typename_pdf", and you should add a new citation to the citations file. Any fields which are not required but appear in the citation should probably be inside a <ep:ifset> so that you don't get see "UNSPECIFIED" if they are not, er, specified.

The main element is "metadatatypes". This contains a list of "dataset" elements each of which has a name attribute.

The "type" elements in user and eprint "dataset"s should contain a list of "field" elements. This describes the fields which may be edited for this type and the order that they appear on the form.

You may include system fields in this list, but be careful if you do.

Multi-page metadata (2.3.0+)

You may optionally add <page name="pagename" /> elements to the field list. These break the submission process into smaller stages. The pagename is used to identify the sub-page, for purposes of validation etc. Pages only have an effect on eprint types, not user, document etc.

See the section on paged metadata.XX

Attributes for "field" element

name (May not be ommited) 
The name of the metadata field.
required 
If set to "yes" then this field may not be left blank. Some system fields are always required no matter how this is set.
staffonly 
This field only appears on the "editor" edit eprint form, not the user one. Or, in the case of the user dataset, the staff edit-user page.

The "security" dataset

This is a handy place to define the security levels. The type with no name is special. It is the "public" security type. All other types will require a valid username and password. If that username is acceptable for a given document is decided by the can_user_view_document subroutine in ArchiveConfig.pm

The "document" dataset

By default eprints requires at least one of ps, pdf, ascii or html to be uploaded before an eprint is valid. You may change this list in ArchiveConfig.pm - any more complicated conditions will have to be checked in the eprint validation subroutine.

phrases-languageid.xml

This file contains a list of XML "phrasees". Everything eprints "says" to users is stored in this file and its system-level counterpart. If you want the site to run in more than one language, you need one phrase file per language.

The phrase file is XML and contains a toplevel "phrases" element. This contains the list of phrases.

Each phrase has a "ref" attribute to identify it and contains text and optionally some XHTML tags. It may also contain eprints entities such as &archivename; and also some phrases should contain "pin" elements, described below.

The phrases in the archive phrase file are specific to that archive, the system phrase file contains non-archive specific phrases. The id's of most of the phrases in the archive phrases are generated from the id's of the fields, datasets, types etc.

The archive phrase file contains: names of dataset types, names of metadata fields, help on entering each Ametadata field, the names of options in "set" fields, the description of different search ordering options, names of browse views, phrases used in the render and validation routines, mail which eprints sends out and phrases which override those in the system file.

pins

Some phrases need some "pin" elements to show eprints where to insert values. Usually pins don't contain any elements but occasionally they do when they represent what to place a link around.

Overriding System Phrases

If you don't like some of the phrases in the main system phrases file you can override them by creating a phrase with the same "ref" in the archive file.

Don't edit the system file, if you upgrade eprints to a newer version it will get over-written.

Emails

EPrints sends out emails when a user registers/changes their password, when a user changes their email, when a deposited item is rejected/deleted by an editor and when the system is low on resources. These mails can be customised in the phrase file.

Make sure you wrap your text in paragraph

tags. EPrints will automatically word wrap these in the email.


elements in a mail are turned into a line of dashes.

When eprints sends a mail it will send it as plain ASCII text, unless it contains latin-1 elements, in which case it will be latin-1 encoded. If it contains unicode characters not in the latin-1 charset then it will be utf-8 encoded.

ruler.xml

This file configures the horizontal divider which eprints uses, which is inserted in place of &ruler;

If you have no great dislike of <hr /> horizontal rulers then you can leave it alone.

You can't use entities like &frontpage; in ruler.

The static/ directory

This directory contains the static pages for the site - the frontpage, the help pages, images, the stylesheet etc.

static/ contains one directory per language, eg. en. Plus a general directory which contains files which don't need translating like images and the stylesheet.

When you run the generate_static command it copies the files for each language, and the gerneral dir, into the static site for that language.

See the generate_static documentation for more details.

subjects

This file is not used by the core eprints system. It is used by import_subjects to set up the initial subjects. For more information see the instructions for import_subjects.

template-languageid.xml

This file is the shell of every page in the system. It is more or less a normal XHTML page but you can use the eprints &foo; entities in it and it should contain "pin" elements like a phrase. The pins it should contain are:

<ep:pin ref="title" />  
This is where to put the title of the page. It can be used more than once - in the title in the page header and somewhere in the body. If placing it in the title in the head of the page you must use the additional attribute textonly="yes" which only works here. It removes images from the title (which can happen if using the "Latex" mode).
<ep:pin ref="head" />  
This goes somewhere in the head of the page. It shows eprints where to insert the "meta" and "link" elements.
<ep:pin ref="pagetop" />  
This goes at the top of the body. It is sometimes used as a "target".
<ep:pin ref="page" />  
Where to place the bulk of the content of the page.


Metadata

EPrints 3 Reference: Directory Structure - Metadata Fields - Repository Configuration - XML Config Files - XML Export Format - EPrints data structure - Data Objects


Metadata Fields: Boolean - Compound - Multilang - Date - Time - Float - Int - Itemref - Pagerange - Set - Namedset - Subject - Text - Email - Longtext - Name - Url

Metadata Field Types

The