Files/IndexNoLatin

From EPrints Documentation
Jump to: navigation, search

Better Indexing for non basic latin chars

This hack is baout EPrints::Index.pm

There are severe problems with indexing foreign language texts. Unfortunately the tables FREETEXT_CHAR_MAPPING, FREETEXT_STOP_WORD, FREETEXT_ALWAYS_WORDS and FREETEXT_SEPARATOR_CHARS went into the system library. They are language, and archive dependent, thus a more flexible handling should happen. http://oziris.ceu.hu/2.3.3/patchIndex.txt is a patch to add the four specific Hungarian accented letters (not latin-1) to the list. It would be better to include all accented utf-8 characters.

The Hungarian characters have been AppliedToCVS after version 2.3.4 - anyone got a nice big list of how to dumb down ALL of unicode? What the hell do you do with kanji?

The http://oziris.ceu.hu/2.3.3/generate_utf8 is a small perl program which consults the perl built-in utf-8 definition file and extracts all characters which are latin invariants. Change the first line to your perl interpreter; it works with perl 5.8.x. The printout is a small perl program which generates a hash with entries unicode => [ lower_case_unicode,lowercase_latin_equivalent ].