Fix for better UTF8 support

From EPrints Documentation
Revision as of 09:35, 11 July 2008 by Tajoli (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Fix for a better UTF8 support

Roman Chyla hase setup an Eprint 3.1 server that handles pretty decently all utf8 characters - if you think, that your installation does that too, then please read on.

An example: [[http://dlib.lib.cas.cz/cgi/search/simple?q=%C4%8Cefel%C3%ADn&_action_search=Search&_order=bytitle&basic_srchtype=ALL&_satisfyall=ALL ]](note the uppercase and lowercace differences and the fact you could not search for uppercased forms in the default EPrints, indexing.pl is not used for authors' names)

Roman believes there is a basic misunderstanding about unicode string object inside EPrints. The operations like uc,lc and also all the regular expression cannot work properly because the object is not marked as utf8 string, you will find example in the indexing.pl below. For more information, please see: [[1]]. And in many places, the utf8() is called that creates the unicode string (and that is something different than utf8 string, there is no utf8 object in Perl). Roman recommend fixing unicode at the input points, not inside of the code at dozens of places


Here is what to do:

1. convert the database tables 2. fix several places inside EPrints (changes are actually very easy, everything in the config files, included for your convenience, it will be very very easy to use it- honestly, in my opinion, no system administrator should ignore this problem)


1.DATABASE CONVERSION


a)dump schema of the database mysqldump --no-data --set-charset -u root -p<password> <db_name> > schema.sql

b)dump the data, it will be actually utf8 encoded, don't be fooled be the charset latin1 bit mysqldump --no-create-info --skip-set-charset -u root -p<yourpassword> --default-character-set=latin1 <db_name> > data.sql

c)open the schema.sql in an editor and:

 1) replace all occurences of CHARSET=latin1 for CHARSET=utf8
 2) also change the dafault NULL charset for columns (see

[[2]])

 3) search for "varchar(255)" and replace "with varchar(255) CHARACTER SET utf8 "

d)set the utf encoding for the data in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > datautf.sql

e)now load the edited db schema (this will recreate the database, AND DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql) mysql <db_name> -u root -p < schema.sql

f)load the data mysql <db_name> -u root -p < datautf.sql