Difference between revisions of "Unicode"

From EPrints Documentation
Jump to: navigation, search
(New page: To support non-ASCII characters (a to z) EPrints uses [http://www.unicode.org/ Unicode], which is a database of characters (glyphs) used by languages of the world. To store Unicode strings...)
 
(Added link to planned 3.4.4 development)
 
(3 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
[[Category:Languages]]
 
To support non-ASCII characters (a to z) EPrints uses [http://www.unicode.org/ Unicode], which is a database of characters (glyphs) used by languages of the world. To store Unicode strings EPrints uses [http://search.cpan.org/~gaas/Unicode-String-2.09/ Unicode::String]. Unicode::String stores strings internally in UCS-2 (2 bytes per character) but converts that to utf-8 when stringified (variable-length bytes per character).
 
To support non-ASCII characters (a to z) EPrints uses [http://www.unicode.org/ Unicode], which is a database of characters (glyphs) used by languages of the world. To store Unicode strings EPrints uses [http://search.cpan.org/~gaas/Unicode-String-2.09/ Unicode::String]. Unicode::String stores strings internally in UCS-2 (2 bytes per character) but converts that to utf-8 when stringified (variable-length bytes per character).
  
Line 17: Line 18:
 
         Table: version
 
         Table: version
 
   Create Table: CREATE TABLE `version` (
 
   Create Table: CREATE TABLE `version` (
     `version` varchar(255) character set latin1 default NULL
+
     `version` varchar(255) default NULL
   ) ENGINE=MyISAM DEFAULT CHARSET=utf8
+
   ) ENGINE=MyISAM DEFAULT CHARSET=latin1
 
   1 row in set (0.00 sec)
 
   1 row in set (0.00 sec)
  
 +
In this case the database is set to latin1 (aka iso-8859-1).
  
 
=== What's actually stored ===
 
=== What's actually stored ===
  
When EPrints stores utf-8 data in MySQL it is simply stored as bytes. For characters below code point 128 this doesn't make any difference (because the Unicode code points as the same as the iso-8859-1 equivalent). For code points above 128 multiple bytes will be stored in MySQL which if viewed from a MySQL client will look like this (in this case é):
+
When EPrints stores utf-8 data in MySQL it is simply stored as bytes. For characters below code point 128 this doesn't make any difference (because the Unicode code points are the same as the iso-8859-1 equivalent). For code points above 128 multiple bytes will be stored in MySQL which, if viewed from a MySQL client, will look like this (in this case é):
  
 
  é
 
  é
Line 32: Line 34:
 
MySQL 5.0 and above support two Unicode encodings: utf-8 and ucs-2.
 
MySQL 5.0 and above support two Unicode encodings: utf-8 and ucs-2.
  
If MySQL is set to use utf-8 it will re-encode the utf-8 data sent from EPrints. For the é example above the data, when viewed in MySQL, will still look like:
+
If MySQL is set to use utf-8 it will re-encode the utf-8 data sent from EPrints. For the "é" example above the data, when viewed in MySQL, will still look like:
  
 
  é
 
  é
  
That's because what MySQL has actually stored is:
+
But what MySQL has actually stored is:
  
 
  é
 
  é
 
And when you view the data in MySQL it is decoding the set of 4 bytes into 2 characters for your MySQL tool.
 
  
 
== MySQL and Collation ==
 
== MySQL and Collation ==
  
MySQL (in common with other databases) will perform some language-specific collation. This concerns how strings should be sorted and matched. For instance MySQL will match "âge" to "age" if the data is correctly stored.
+
MySQL (in common with other databases) will perform some language-specific collation. Collation determines how strings should be sorted and matched. For instance MySQL will match "âge" to "age" if the data is correctly stored.
  
 
Because MySQL does not realise the data it is passed by EPrints is utf-8 it will not store the correct data and hence can not correctly perform collation for non-ASCII characters.
 
Because MySQL does not realise the data it is passed by EPrints is utf-8 it will not store the correct data and hence can not correctly perform collation for non-ASCII characters.
Line 50: Line 50:
 
== Making EPrints and MySQL talk the same language ==
 
== Making EPrints and MySQL talk the same language ==
  
To enable MySQL to understand the data EPrints is giving it (utf-8) you need to call the following statement at the start of an EPrints session:
+
From EPrints 3.3 onwards MySQL connections are configured to understand the data stored as UTF-8.  This is acheived by the following MySQL command being used each time a MySQL connectionn is established.
  
 
  SET NAMES utf8;
 
  SET NAMES utf8;
  
You can do this by modifying archives/[repoid]/cfg/cfg.d/session.pl as follows:
+
If you have a very old MySQL database you may need to migrate individual columns from Latin-1 to UTF-8:
 +
 
 +
ALTER [table] MODIFY [column] VARBINARY(255);
 +
ALTER [table] MODIFY [column] VARCHAR(255) CHARACTER SET utf8;
 +
 
 +
ALTER [table] MODIFY [column] BLOB;
 +
ALTER [table] MODIFY [column] TEXT CHARACTER SET utf8;
 +
 
 +
== Managing 32-Bit Unicode Characters ==
 +
 
 +
In recent years there has been a massive expansion to the number of Unicode characters, to the extent that there is not enough bits to represent each character in UTF-8.  Therefore there was an augmentation called UTF-8 MB4, which allows this extra Unicode characters, such as 𝒪 (mathematical script capital O).  If you need support for such characters there are two things you need to do:
 +
 
 +
1. Modify '''perl_lib/EPrints/Database/mysql.pm''' and chnage the line:
 +
$self->do("SET NAMES 'utf8'");
 +
to:
 +
$self->do("SET NAMES 'utf8mb4'");
  
  $c->{session_init} = sub
+
2. Modify the column in the table. You cannot do this for VARCHAR columns unless they are 191 characters or fewer.  Due to the maximum number of bytes allowed for a VARCHAR column.
  {
+
ALTER [table] MODIFY [column] TEXT CHARACTER SET utf8mb4;
    my( $session, $offline ) = @_;
+
  ALTER [table] MODIFY [column] LONGTEXT CHARACTER SET utf8mb4;
 +
ALTER [table] MODIFY [column] VARCHAR(191) CHARACTER SET utf8mb4;
  
    $session->get_database->do("SET NAMES utf8");
+
There are plans to implement these changes in [[EPrints 3.4.4]], so that newly created repository will be able to handle 32-bit Unicode characters. If this is implemented upgrading to 3.4.4 will likely still require manually updating existing fields.
  };
 
  
Warning! You can't do this to an existing database because it's storing utf-8 in an iso-8859-1 encoded columns. To migrate an existing database you need to do this for every column:
+
For more information see https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/

Latest revision as of 23:09, 26 March 2021

To support non-ASCII characters (a to z) EPrints uses Unicode, which is a database of characters (glyphs) used by languages of the world. To store Unicode strings EPrints uses Unicode::String. Unicode::String stores strings internally in UCS-2 (2 bytes per character) but converts that to utf-8 when stringified (variable-length bytes per character).

Perl's native encoding for Unicode is utf-8.

EPrints 3.2 replaces Unicode::String with Perl's native Unicode support.

Unicode and MySQL

Before 5.0 MySQL only supported iso-8859-1 (aka latin-1) character encodings.

Determining the current character set

To determine the character set MySQL is currently using execute this from the MySQL client:

mysql> show create table version;
 *************************** 1. row ***************************
        Table: version
 Create Table: CREATE TABLE `version` (
   `version` varchar(255) default NULL
 ) ENGINE=MyISAM DEFAULT CHARSET=latin1
 1 row in set (0.00 sec)

In this case the database is set to latin1 (aka iso-8859-1).

What's actually stored

When EPrints stores utf-8 data in MySQL it is simply stored as bytes. For characters below code point 128 this doesn't make any difference (because the Unicode code points are the same as the iso-8859-1 equivalent). For code points above 128 multiple bytes will be stored in MySQL which, if viewed from a MySQL client, will look like this (in this case é):

é

When the data is read by EPrints it will use the raw bits as utf-8. So long as MySQL doesn't change the data coming in and out everything works correctly.

MySQL 5.0 and above support two Unicode encodings: utf-8 and ucs-2.

If MySQL is set to use utf-8 it will re-encode the utf-8 data sent from EPrints. For the "é" example above the data, when viewed in MySQL, will still look like:

é

But what MySQL has actually stored is:

é

MySQL and Collation

MySQL (in common with other databases) will perform some language-specific collation. Collation determines how strings should be sorted and matched. For instance MySQL will match "âge" to "age" if the data is correctly stored.

Because MySQL does not realise the data it is passed by EPrints is utf-8 it will not store the correct data and hence can not correctly perform collation for non-ASCII characters.

Making EPrints and MySQL talk the same language

From EPrints 3.3 onwards MySQL connections are configured to understand the data stored as UTF-8. This is acheived by the following MySQL command being used each time a MySQL connectionn is established.

SET NAMES utf8;

If you have a very old MySQL database you may need to migrate individual columns from Latin-1 to UTF-8:

ALTER [table] MODIFY [column] VARBINARY(255);
ALTER [table] MODIFY [column] VARCHAR(255) CHARACTER SET utf8;
ALTER [table] MODIFY [column] BLOB;
ALTER [table] MODIFY [column] TEXT CHARACTER SET utf8;

Managing 32-Bit Unicode Characters

In recent years there has been a massive expansion to the number of Unicode characters, to the extent that there is not enough bits to represent each character in UTF-8. Therefore there was an augmentation called UTF-8 MB4, which allows this extra Unicode characters, such as 𝒪 (mathematical script capital O). If you need support for such characters there are two things you need to do:

1. Modify perl_lib/EPrints/Database/mysql.pm and chnage the line:

$self->do("SET NAMES 'utf8'");

to:

$self->do("SET NAMES 'utf8mb4'");

2. Modify the column in the table. You cannot do this for VARCHAR columns unless they are 191 characters or fewer. Due to the maximum number of bytes allowed for a VARCHAR column.

ALTER [table] MODIFY [column] TEXT CHARACTER SET utf8mb4;
ALTER [table] MODIFY [column] LONGTEXT CHARACTER SET utf8mb4;
ALTER [table] MODIFY [column] VARCHAR(191) CHARACTER SET utf8mb4;

There are plans to implement these changes in EPrints 3.4.4, so that newly created repository will be able to handle 32-bit Unicode characters. If this is implemented upgrading to 3.4.4 will likely still require manually updating existing fields.

For more information see https://www.eversql.com/mysql-utf8-vs-utf8mb4-whats-the-difference-between-utf8-and-utf8mb4/