Fix for better UTF8 support

From EPrints Documentation
Jump to: navigation, search

Fix for a better UTF8 support

Roman Chyla hase setup an Eprint 3.1 server that handles pretty decently all utf8 characters - if you think, that your installation does that too, then please read on.

The original post is available here: http://threader.ecs.soton.ac.uk/lists/eprints_tech/9027.html

An example: [[http://dlib.lib.cas.cz/cgi/search/simple?q=%C4%8Cefel%C3%ADn&_action_search=Search&_order=bytitle&basic_srchtype=ALL&_satisfyall=ALL ]](note the uppercase and lowercace differences and the fact you could not search for uppercased forms in the default EPrints, indexing.pl is not used for authors' names)

Roman believes there is a basic misunderstanding about unicode string object inside EPrints. The operations like uc,lc and also all the regular expression cannot work properly because the object is not marked as utf8 string, you will find example in the indexing.pl below. For more information, please see: [[1]]. And in many places, the utf8() is called that creates the unicode string (and that is something different than utf8 string, there is no utf8 object in Perl). Roman recommend fixing unicode at the input points, not inside of the code at dozens of places


Here is what to do:

1. convert the database tables 2. fix several places inside EPrints (changes are actually very easy, everything in the config files, included for your convenience, it will be very very easy to use it- honestly, in my opinion, no system administrator should ignore this problem)


1.DATABASE CONVERSION


a)dump schema of the database mysqldump --no-data --set-charset -u root -p<password> <db_name> > schema.sql

b)dump the data, it will be actually utf8 encoded, don't be fooled be the charset latin1 bit mysqldump --no-create-info --skip-set-charset -u root -p<yourpassword> --default-character-set=latin1 <db_name> > data.sql

c)open the schema.sql in an editor and:

 1) replace all occurences of CHARSET=latin1 for CHARSET=utf8
 2) also change the dafault NULL charset for columns (see

[[2]])

 3) search for "varchar(255)" and replace "with varchar(255) CHARACTER SET utf8 "

d)set the utf encoding for the data in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > datautf.sql

e)now load the edited db schema (this will recreate the database, AND DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql) mysql <db_name> -u root -p < schema.sql

f)load the data mysql <db_name> -u root -p < datautf.sql

Now you will have the database in utf8, but certain parts of Eprints will not benefit from it. What you need to do next is to override (or change) code base.


I have created my own module and placed it inside
perl_lib/FixIncredibleSoftware.pm (see below)

then in  session.pl, you need this

$c->{session_init} = sub
{
  my( $session, $offline ) =  AT _;
  $session->get_database->do("SET NAMES utf8");
  require FixIncredibleSoftware;
};

this will affect only your repository, will leave other intact, will
not change EPrints codebase (less headaches for you) and it works for
command-line programs too

The FixIncredibleSoftware.pm looks like this (perl programmers can see
that I am basically dealing with utf8 strings and my strings are
locale specific - the locale on my system is cs_CZ.UTF8, if you want
your EPrints installation to deal with your own alphabet, change your
locale environment)

package FixIncredibleSoftware;
use Sub::Override;
use vars qw /$OVERRIDE/;
use Unicode::String qw(utf8 );
$OVERRIDE = Sub::Override->new(  ); # assumes Foo::foo


$OVERRIDE->replace('EPrints::Index::add', \&add);
$OVERRIDE->replace('EPrints::MetaField::Name::get_index_codes_basic',
\&get_index_codes_basic);

print STDERR "FixIncredibleSoftware loaded\n";

#I just don't like this
$EPrints::Index::FREETEXT_CHAR_MAPPING = { };

sub add
{
	my( $session, $dataset, $objectid, $fieldid, $value ) =  AT _;

	my $field = $dataset->get_field( $fieldid );

	my( $codes, $grepcodes, $ignored ) = $field->get_index_codes(
$session, $value );

	my %done = ();

	my $keyfield = $dataset->get_key_field();

	my $indextable = $dataset->get_sql_index_table_name();
	my $rindextable = $dataset->get_sql_rindex_table_name();


	my $rv = 1;
	my $dbh = $session->get_database->{dbh};

	foreach my $code (  AT {$codes} )
	{
		next if $done{$code};
		$done{$code} = 1;
		my $sql;
		my $fieldword =
EPrints::Database::prep_value($field->get_sql_name().":$code");
		my $sth;
		$sql = sprintf ("SELECT max(pos) FROM %s where fieldword=%s",
$indextable, $dbh->quote($fieldword));
		$sth=$session->get_database->prepare( $sql );
		$rv = $rv && $session->get_database->execute( $sth, $sql );
		return 0 unless $rv;
		my ( $n ) = $sth->fetchrow_array;
		$sth->finish;
		my $insert = 0;
		if( !defined $n )
		{
			$n = 0;
			$insert = 1;
		}
		else
		{
			$sql = sprintf ("SELECT ids FROM $indextable WHERE fieldword=%s AND
pos=%d", $dbh->quote($fieldword), $n);
			$sth=$session->get_database->prepare( $sql );
			$rv = $rv && $session->get_database->execute( $sth, $sql );
			my( $ids ) = $sth->fetchrow_array;
			$sth->finish;
			my(  AT list ) = split( ":",$ids );
			# don't forget the first and last are empty!
			if( (scalar  AT list)-2 < 128 )
			{
				$sql = sprintf ("UPDATE $indextable SET ids=%s WHERE fieldword=%s
AND pos=%d", $dbh->quote("$ids$objectid:"), ↵
$dbh->quote($fieldword),
$n);
				$rv = $rv && $session->get_database->do( $sql );
				return 0 unless $rv;
			}
			else
			{
				++$n;
				$insert = 1;
			}
		}
		if( $insert )
		{
			$sql = sprintf ("INSERT INTO $indextable (fieldword,pos,ids )
VALUES (%s,%d,%s)", $dbh->quote($fieldword), $n,
$dbh->quote(":$objectid:"));
			$rv = $rv && $session->get_database->do( $sql );
			return 0 unless $rv;
		}
		$sql = sprintf ("INSERT INTO $rindextable
(field,word,".$keyfield->get_sql_name()." ) VALUES
('".$field->get_sql_name."',%s,%s)", $code, $objectid);
		$rv = $rv && $sth;
		return 0 unless $rv;

	}

	my $name = $field->get_name;
  my $tbl_name = $dataset->get_sql_grep_table_name;
  #print STDERR $tbl_name, "\n";

  my $sql_insert = $dbh->prepare("INSERT INTO `$tbl_name` VALUES ↵
(?,?,?)");
	foreach my $grepcode (  AT {$grepcodes} )
	{
		$sql_insert->execute( $objectid, $name, $grepcode );
	}
}

sub get_index_codes_basic
{
  my( $self, $session, $value ) =  AT _;
  return( [], [], [] ) unless( EPrints::Utils::is_set( $value ) );

  #this is unicode object (do they know that utf8 is inside?)
  my $f = &EPrints::Index::apply_mapping( $session, $value->{family} );
  my $g = &EPrints::Index::apply_mapping( $session, $value->{given} );

  #print STDERR "utf8() returns unicode::string object, not utf8! " .
Dumper($f);
  #print STDERR "So every operation on the string lc, uc, substr
ignores the locale\n";
  #print STDERR "And all the regular expressions like \\U \\L [a-zA-Z]
are plainly wrong\n";

  # Add a space before all capitals to break
  # up initials. Will screw up names with capital
  # letters in the middle of words. But that's
  # pretty rare.
  my $utf8_string = Encode::decode('UTF-8', $g->utf8, 0);
  my $len_g = length($utf8_string);
  my $new_g = utf8( "" );
  for(my $i = 0; $i<$len_g; ++$i )
  {
    my $s = substr( $utf8_string, $i, 1 );
    if( $s eq "\U$s" )
    {
      $new_g .= ' ';
    }
    $new_g .= $s;
  }

  my $code = '';
  my  AT r = ();
  foreach( EPrints::Index::split_words( $session, $f ) )
  {
    next if( $_ eq "" );
    my $word = lc Encode::decode('UTF-8', $_->utf8, 0);
    push  AT r, lc $word if length($word) > 2;
    $code.= "[$word]";
  }
  $code.= "-";
  foreach( EPrints::Index::split_words( $session, $new_g ) )
  {
    next if( $_ eq "" );
    my $word = lc Encode::decode('UTF-8', $_->utf8, 0);
    push  AT r, $word if length($word) > 2;
    $code.= "[$word]";
  }
  #print STDERR "--get_index_codes_basic returns:\n";
  #print STDERR Dumper( \ AT r, [$code], [] );
  #print STDERR $code, "\n";
  return( \ AT r, [$code], [] );
}

1;


I also needed to change cfg.d/indexing.pl

$c->{extract_words} = sub
{
	my( $session, $text ) =  AT _;

	# Acronym processing only works on uppercase non accented
	# latin letters. If you don't want this processing comment
	# out the next few lines.

	# Normalise acronyms eg.
	# The F.B.I. is like M.I.5.
	# becomes
	# The FBI  is like MI5
	# These are rather expensive to run, so are being commented out
	# by default.
	#my $a;
	#$text =~ s#[A-Z0-9]\.([A-Z0-9]\.)+#$a=$&;$a=~s/\.//g;$a#ge;
	# Remove hyphens from acronyms
	#$text=~ s#[A-Z]-[A-Z](-[A-Z])*#$a=$&;$a=~s/-//g;$a#ge;

	# Process string.
	# First we apply the char_mappings.
	my( $i, $len ),
	my $utext = utf8( "$text" ); # just in case it wasn't already.
	$len = $utext->length;
	my $buffer = utf8( "" );
	for($i = 0; $i<$len; ++$i )
	{
		my $s = $utext->substr( $i, 1 );
		# $s is now char number $i
		if( defined $EPrints::Index::FREETEXT_CHAR_MAPPING->{$s} )
		{
			$s = $EPrints::Index::FREETEXT_CHAR_MAPPING->{$s};
		}
		$buffer.=$s;
	}

	my  AT words =EPrints::Index::split_words( $session, $buffer );

	# Iterate over every word (bits divided by seperator chars)
	# We use hashes rather than arrays at this point to make
	# sure we only get each word once, not once for each occurance.
	my %good = ();
	my %bad = ();
	my $word;
	foreach $word (  AT words )
	{	
		# skip if this is nothing but whitespace;
    $word = Encode::decode('UTF-8', $word->utf8, 0);
		next if ($word =~ /^\s*$/);

		# calculate the length of this word
		my $wordlen = length $word;

		# $ok indicates if we should index this word or not

		# First approximation is if this word is over or equal
		# to the minimum size set in SiteInfo.
		my $ok = $wordlen >= $c->{indexing}->{freetext_min_word_size};
	
		# If this word is at least 2 chars long and all capitals
		# it is assumed to be an acronym and thus should be indexed.
		if( $word =~ m/^[A-Z][A-Z0-9]+$/ )
		{
			$ok=1;
		}

		# Consult list of "never words". Words which should never
		# be indexed.	
		if( $c->{indexing}->{freetext_stop_words}->{lc $word} )
		{
			$ok = 0;
		}
		# Consult list of "always words". Words which should always
		# be indexed.	
		if( $c->{indexing}->{freetext_always_words}->{lc $word} )
		{
			$ok = 1;
		}
	
		# Add this word to the good list or the bad list
		# as appropriate.	
		unless( $ok )
		{
			$bad{$word}++;
			next;
		}

		# Only "bad" words are used in display to the
		# user. Good words can be normalised even further.

		# non-acronyms (ie not all UPPERCASE words) have
		# a trailing 's' removed. Thus in searches the
		# word "chair" will match "chairs" and vice-versa.
		# This isn't perfect "mose" will match "moses" and
		# "nappy" still won't match "nappies" but it's a
		# reasonable attempt.
		$word =~ s/s$//;

		# If any of the characters are lowercase then lower
		# case the entire word so "Mesh" becomes "mesh" but
		# "HTTP" remains "HTTP".
		#if( $word =~ m/[a-z]/ )
		#{
		#	$word = lc $word;
		#}
    $word =~ s/\W//go;
    $word = lc $word;
		$good{$word}++;
	}
	# convert hash keys to arrays and return references
	# to these arrays.
	my(  AT g ) = keys %good;
	my(  AT b ) = keys %bad;
	return( \ AT g , \ AT b );
};