Language Issues in EPrints 2

From EPrints Documentation
Jump to: navigation, search


Warning This page contains out of date information and may no longer work or be superseded by a better technique.

As EPrints uses utf-8 encoding internally, there is no problem to store documents or entering metadata in any language. This includes the name of authors, title and abstract, etc. If you want the software to speak a different language, that's a separate issue.

Presently EPrints is available in several different languages (See http://files.eprints.org/view/type/translation.html), and there are projects for others as well. All text the EPrints software uses are in parameter files, which are language dependent. To make EPrints to speak a new language, you only have to translate the content of those files. There might however, certain problems arise.

If you are using ASCII characters (no special characters or accented letters) in the translations, then any EPrints version works for you. If you can live with latin-1 characters then probably you will encounter no problem. There are working translations with latin-2 characters; this shows that translation to the majority of European languages should pose no serious problem. There is a project to produce a Russian version. Making Arabic or Hebrew versions might be problematic.

If you want to use local language version, use GDOME, and if possible, Apache 2, and perl 5.8.x Earlier perl interpreters have problems with utf-8 encodings.

The following discussion assumes the system is using GDOME.

The native encoding used by EPrints is utf-8. Thus any file using this encoding can be interpreted by EPrints directly. EPrints can understand other encodings as well. Data files with .xml extension are XML files. They always start with the following lines:

<?xml version="1.0" encoding="iso-8859-1" standalone="no" ?>
<!DOCTYPE phrases SYSTEM "entities-en.dtd">

The encoding attribute tells the encoding the file uses. If it is missing, then utf-8 is assumed. Here "phrases" is the document type (it can be anything), and the Document Type Definition file entities-en.dtd contains the definition of entities of the form "&text;" such as &aacute; or &nbsp; or &archivename;

Thus you can always choose of your favourite encoding (assuming that perl was configured to understand that encoding); however you can only use those character representations which are present in the given dtd file. The only exception is character encoding of the form &#xFE; or &#[=xA6B1=]; (uft8 codes) which are always translated to the corresponding utf8-encoded character.

Phrase files

Suppose you are making your translation to Swahili. The two-letter code for this language (after consulting the languages.xml file in the default cfg directory) is sw. Thus first you should prepare the system-phrases-sw.xml, phrases-sw.xml, citation-sw.xml files, and furthermore template and skeleton files which will be used in generating static and dynamic web pages.

The system-phrases-sw.xml should start with the following two lines:

<?xml version="1.0" encoding="XXXXXX" standalone="no" ?>
<!DOCTYPE phrases SYSTEM "entities-sw.dtd">

The encoding must be a valid encoding; the dtd file entities-sw.dtd is generated automatically (you cannot edit it), and contains all standard definitions from the file xhtml-entities.dtd (which, however, you can edit) plus further EPrints specific definitions, such as &adminemail; &archivename; etc. The file must use the given encoding, and can use only those entities which are defined in the entities-sw.dtd. As the archive might have different names on different languages, definition of &archivename; depends on the language - that's the reason why the entities file depend on the language as well.

The best way to go on to copy an existing phrases file, and do the translation on it. DO NOT FORGET to edit the first two lines. The text which will appear is defined as

<ep:phrase ref="name_of_the:phrase">The definition for the phrase</ep:phrase>

What you see after ref is the name of the phrase, it is a text consisting of small English letters, colon and slash. In general, it tells which unit uses the phrase, and what the phrase should express. In the majority of the cases you might guess where the phrase will appear, sometimes you must consult with the program text.

The definition can be any XML-compliant text. This means that tags enclosed within < and > signs must come in pairs (every tag should be closed), and be properly embedded. Thus if you start a new paragraph by entering <p>, you must also close it by </p>. Tags may have attributes but the value of an attribute must always be enclosed in quotation marks. To include a picture you can enter

<img src="whatever.gif" width="100" height="12" alt="PIC" />

You must use quotation marks around 100, 12 and PIC even if it is not required by the html standard. Also, the img tag must be closed, thus you should end it by /> indicating that there will be no closing tag.

You can use all entities (those starting with an & sign) defined in "entities-ew.dtd", and character encodings of the form &#x9A; or &#xB60A; using utf-8 codes.

There is a special feature which makes EPrint's phrase construction especially flexible: pins. When a phrase is generated, there might be other phrases available which will be inserted at particular points. Those points are marked by the tag <ep:pin ref="pinname" />. A phrase might have several pins (or none), which can be inserted at several places (not only at a single place) and in any order -- however you cannot insert them as an attribute for a tag (which is forbidden by the XML syntax). You can find out what pins are available for a partricular phrase by consulting the default phrase files, or looking up the program text.

There is one particular exception to the above general rule, namely links. In this case the pin itself is the URL only, and the text which will refer to that URL should be entered between <ep:pin ref="link> and </ep:pin> as here:

<ep:phrase ref="lib/userpage:number_of_records">The number of <ep:pin ref="link">records this user has: <ep:pin ref="n"/></ep:pin></ep:phrase>

This phrase has two pins: "link" is a link to the user's record, and "n" is the actual number of records the user submitted to the archive. Pins can be used to maintain correct word order in the particular language. LanguageIssueToDo discusses other pin-related problems.

Citation files

Translation of citation files is a bit more trickier; things you may learn here can be very useful in costumizing EPrints. First, start the citation file with the following lines:

<?xml version="1.0" encoding="XXXXXX" standalone="no" ?>
<DOCTYPE citations SYSTEM "entities-sw.dtd">

This lets you to use all "&text;" entities defined for the appropriate language in the entities file.

The general format of a citation is very similar to that of a phrase:

<ep:citation type="record_type">How the record should be rendered
</ep:citation>

Here record_type is an identifier which identifies a redering style. Records can be rendered using different styles as defined in the archive configuration files. If no citation style is given, the default one is used.

In the rendering definition you can refer to the values of different fields, which, in a certain sense, is similar to the pins in phrases. The syntax, however, is different. The field name is enclosed between @ signs; if you want to produce a single @ character, you must enter two consecutive ones.The exact syntax is the following: just after the @ sign you must enter the field name. It could be followed by .id indicating that you want to see not the field's value, but the unique identifier attached to it. This might be followed by a sequence of modifiers, separated from the field name and from each other by a semicolon. A modifier can be a single word, or a word, followed by an = sign and followed by a value. For example,

@pagerange@,  @creators.id@, @date_effective;res=year@,
@title;magicstop@

For the list of modifiers and for their effect, please see the EPrints documentation.

Due to the special syntax, field values can appear as attributes. In this case, however, you cannot define modifiers or .id (LanguageIssueToDo#citaddrid Why not?) For example, the following is a legitimate citation text:

<p align="center"><img src="@coverimage@" alt="[picture]" /></p>

Tags of the form <ep:XXXX> ... </ep:XXXX> are handled by the rendering routine and do not show up in the citation. If the tag is not listed below, then the complete text between the opening and closing tag is left out entirely (still it must have a proper XML format).

<ep:ifset name="FIELDNAME">...</ep:ifset>

The (rendered form of the) text between the opening and closing tags gets into the result only if the name field is not empty.

<ep:ifnotset name="FIELDNAME">...</ep:ifnotset>

The text between the opening and closing tags gets into the result only if the field is empty. This can be used in conjuction with the previous tags to get different forms depending on whether the field is set or not.

<ep:ifmatch name="FIELDNAME" value="SEARCHCONDITION" ... >...</ep:ifmatch>

The "Swiss army knife" of Eprints. Arbitrary search condition can be given, even for several fields separated by slash characters. If the condition holds the text between the opening and closing tag goes to the result. For usage, see the EPrints documentation.

<ep:ifnotmatch ...> ... </ep:ifnotmatch>

Just the opposite of the previous one; the condition is "reversed".

<ep:iflink>...</ep:iflink>

The same citation style can be used to render a record for pointing to an URL (being a link), or just for plain text. For example, in a browse list records appear as links; in a list to be printed out they are not. The text between the tags is used only if this rendering will become a link.

<ep:ifnotlink>...</ep:ifnotlink>

The opposite of the previous condition.

<ep:linkhere>...</ep:linkhere>

If the citation is used to render a link, then the text between these tags will point to the URL.(It behaves similarly to the <ep:pin ref="link"> ... </ep:pin> tags.) If the citation is used not for a link, then these tags have no effect, but everything between them is rendered as necessary. Note: if a citation style is ever used for rendering a link, then it must contain <ep:linkhere> ... </ep:linkhere> with something in between -- otherwise there will be nothing which could point to the URL. In a single citation there might be several <ep:linkhere> ... </ep:linkhere> pairs.

Template and skeleton files

These files include the template-sw.xml template file, and all skeleton files under the defaultconfig/static/sw directory. Translating these files can be quite straighforward by now. Maybe a small remark is in place here. Probably you want to retain at least the English version for those casual visitors who do not speak your language. Thus you might want to include an "in English" or "Languages" item in the top menu bar. To this end, you can modify the top of template-en.xml to the following:

Clicking on the second item the user can choose any of the available languages. When translating this page, please do not include the local variant of Language, as people might not recognize whih button they have to click on to change the language. Rather include this line:

     <a class="menulink" href="&perl_url;/set_lang?langid=en">In English</a> ||

which is easy to recognize in a foreign language page.

Subject list

Subjects are presented on the choosen language. Presently translation of the default subject list is in a huge .xml file which can use a single encoding only -- LanguageIssueToDo#subjects to be reviewed

Indexing and searching

Problems discussed here do not belong very tightly to translation, however those questions must be addressed somewhere. The problem is searching for names and phrases.

When indexing a textual field, roughly the following happens (Please correct me!)

There is a list of characters FREETEXT_SEPARATOR_CHARS, a map FREETEXT_CHAR_MAPPING, and two word lists FREETEXT_STOP_WORDS and FREETEXT_ALWAYS_WORDS defined in the library file EPrints/Index.pm.

  • First the content of the field is translated by consulting MAPPING. Characters which have and equivalent character sequence in that list are replaced. Character no in the MAPPING are retained. (The intended effect is that accents and diacritics should be stripped.)
  • The resulted string is split into "words". Each word is a sequence of characters NOT in the SEPARATORS list.
  • If the word contains a lowercase letter then every letter in it is converted into lowercase.
  • If the word is among the ALWAYS words, it is retained. If among the STOP words, it is discarded as well if it is shorter than FREETEXT_MIN_WORD_SIZE.
  • Finally only one of each resulted word is kept.

The exact algorithm can be modified in the archive-dependent ArchiveTextIndexingConfig.pm file. However, this algorithm does not apply for names -- names are treated differently and the indexing behaviour cannot be configured.

When searching, the text entered into the search field goes through this procedure as well. Words thrown away are listed as bad words which cannot be indexed; retained words are searched in the database for hits.

This method achieves that words entered with or without accents will be found.