Difference between revisions of "Indexing.pl"

From EPrints Documentation
Jump to: navigation, search
m
m
Line 7: Line 7:
 
* '''<code>$c->{indexing}->{freetext_min_word_size} </code>''' - The minimum length a word in free-text field has to be to be indexed.  The default is 3.
 
* '''<code>$c->{indexing}->{freetext_min_word_size} </code>''' - The minimum length a word in free-text field has to be to be indexed.  The default is 3.
 
* '''<code>$c->{indexing}->{freetext_stop_words}</code>''' - Words that should not be indexed in free-text fields, as they are too common (e.g. and, are, the, you, etc.).
 
* '''<code>$c->{indexing}->{freetext_stop_words}</code>''' - Words that should not be indexed in free-text fields, as they are too common (e.g. and, are, the, you, etc.).
* '''<code>$c->{indexing}->{freetext_seperator_chars}</code>''' - Characters that separate two separate words in a free-text field (e.g. colon <tt>:</tt>, equals <tt>=</tt> hyphen <tt>-</tt>, full stop <tt>.</tt>, space <tt> </tt>, etc.).
+
* '''<code>$c->{indexing}->{freetext_seperator_chars}</code>''' - Characters that separate two separate words in a free-text field (e.g. colon <tt>:</tt>, equals <tt>=</tt> hyphen <tt>-</tt>, full stop <tt>.</tt>, space <tt> </tt>, etc.). N.B. ''seperator'' was a typo in the codebase that cannot now be fixed for legacy reasons.
  
 
The file also contains the '''extract_words''' function for how individual words should be extracted from free-text.  This may vary across different types of repository and some repositories may have edge cases they need to handle, so this has be purposefully designed as a user-defined function to facilitate bespoke requirements.
 
The file also contains the '''extract_words''' function for how individual words should be extracted from free-text.  This may vary across different types of repository and some repositories may have edge cases they need to handle, so this has be purposefully designed as a user-defined function to facilitate bespoke requirements.

Revision as of 12:48, 23 January 2022

EPrints 3 Reference: Directory Structure - Metadata Fields - Repository Configuration - XML Config Files - XML Export Format - EPrints data structure - Core API - Data Objects


Back to cfg.d

This file contains configuration for indexing data objects.

In particular this has configuration for whether indexing is enabled and if so the following configuration rules:

  • $c->{indexing}->{freetext_min_word_size} - The minimum length a word in free-text field has to be to be indexed. The default is 3.
  • $c->{indexing}->{freetext_stop_words} - Words that should not be indexed in free-text fields, as they are too common (e.g. and, are, the, you, etc.).
  • $c->{indexing}->{freetext_seperator_chars} - Characters that separate two separate words in a free-text field (e.g. colon :, equals = hyphen -, full stop ., space , etc.). N.B. seperator was a typo in the codebase that cannot now be fixed for legacy reasons.

The file also contains the extract_words function for how individual words should be extracted from free-text. This may vary across different types of repository and some repositories may have edge cases they need to handle, so this has be purposefully designed as a user-defined function to facilitate bespoke requirements.