Files/FileNamesUTF8

From EPrints Documentation
Jump to: navigation, search

Hack to have file names with utf-8 chars (for browsing)

All start with a problem about browsing and UFT-8. I have done a browsing index on journal titles (as for example 'Library Tech News', ' Library tech', ec.). The name of the field where uploader inserts the journal title is 'pubblication'.


The browsing index is done of static web pages, there is menu page with all journal titles and links to pages of every journal. The name of the web page of one jornal is <journal title>.html For example: http://eprints.rclis.org/view/journtitle/Biology_Education.html The name of the file in the file system is Biology_Education.html


But what happens when you have journal title with name not in latin chars ? Reading RFC 1738 is clear that web link must be only with US-ASCII chars, so EPrints transalte the US-ASCII chars in numeric, using the numbers of UTF-8. For example: http://eprints.rclis.org/view/journtitle/==6D25==56FE==5B66==520A.html ( a chinese journal). But this translation uses many chars, 6 chars for every sign not US-ASCII. In Linux, with file system ext3 the file name could be at max 255 byte long. I can index only jornal titles with a t max 42 signs, too few.


The problem is present in the line 484 of generate_views:

       print FILE EPrints::XML::to_string( $page, undef, 1 );

So the solution (the idea came from Chris) is to hack the routine that manages the name of files. The routine is escape_filename in EPrints::Utils.pm.

From http://eprints.rclis.org/fixsoft/Utils.pm.gz you can download the routine hacked by me. Please be careful, it is a huge change. You must to rebuilt all files with generate_abstrat, generate_views, generate_static (after a stop and restart of Apache). You must have a file system that support UTF-8 in file names (like ext3).

The result (a journal in greek): http://eprints.rclis.org/view/journtitle/%CE%A4%CE%B5%CE%BA%CE%BC%CE%AE%CF%81%CE%B9%CE%BF%CE%BD.html The link is written with hex values, but on the file system the name is with greek chars.