Difference between revisions of "API:EPrints/Plugin/Search/Xapian"
(Created page with '<!-- Pod2Wiki=_preamble_ This page has been automatically generated from the EPrints 3.2 source. Any wiki changes made between the 'Pod2Wiki=*' and 'Edit below this comment' com…') |
m (reformatted and some typos corrected) |
||
Line 70: | Line 70: | ||
<!-- Pod2Wiki=head_usage --> | <!-- Pod2Wiki=head_usage --> | ||
==USAGE== | ==USAGE== | ||
− | Install the | + | Install the [http://search.cpan.org/search?query=xapian&mode=dist Search::Xapian] extension. Note: there are two Perl bindings available for Xapian. The CPAN version is older and based on Perl-XS. xapian-bindings-perl available from xapian.org is based on SWIG and has better coverage of the API. Regardless, for the best feature support/performance it is highly recommended to have the latest stable version of the Xapian library. |
− | Xapian uses a separate (from MySQL) index that is stored in '' | + | Xapian uses a separate (from MySQL) index that is stored in <tt>archives/''[archiveid]''/var/xapian</tt>. To build the Xapian index you will need to reindex: |
<pre> ./bin/epadmin reindex [archiveid] eprint</pre> | <pre> ./bin/epadmin reindex [archiveid] eprint</pre> | ||
Line 78: | Line 78: | ||
(Repeat for any other datasets you expect to use Xapian with.) | (Repeat for any other datasets you expect to use Xapian with.) | ||
− | The | + | The <tt>var/xapian/</tt> directory should contain something like: |
− | <pre> flintlock position.baseA position.DB postlist.baseB record.baseA record.DB termlist.baseB | + | <pre> flintlock position.baseA position.DB postlist.baseB |
− | iamchert position.baseB postlist.baseA postlist.DB record.baseB termlist.baseA termlist.DB</pre> | + | record.baseA record.DB termlist.baseB |
+ | iamchert position.baseB postlist.baseA postlist.DB | ||
+ | record.baseB termlist.baseA termlist.DB</pre> | ||
− | The indexing process for Xapian is in | + | The indexing process for Xapian is in <tt>lib/cfg.d/search_xapian.pl</tt>. This can be overridden by dropping the same-named file into your repository <tt>archives/''[archiveid]''/cfg.d/</tt>. If the Xapian search is not matching what you might expect it to, you probably need to fix the indexing process (and re-index!). Terms indexed by Xapian can also be weighted to e.g. give names a higher weighting than abstract text. |
You will need to restart your Apache server to enable the Xapian search plugin and dependencies. | You will need to restart your Apache server to enable the Xapian search plugin and dependencies. | ||
Line 95: | Line 97: | ||
<!-- Pod2Wiki=head_lock_files --> | <!-- Pod2Wiki=head_lock_files --> | ||
===Lock Files=== | ===Lock Files=== | ||
− | Xapian maintains a lock file in ''var/xapian | + | Xapian maintains a lock file in <tt>archives/''[archiveid]''/var/xapian</tt>. If you see indexing errors about not being able to lock the database ensure you aren't running multiple copies of the EPrints [http://wiki.eprints.org/w/API:bin/indexer indexer]. If no other processes are running you may need to manually remove the lock file from the <tt>var/xapian</tt> directory. While only one process may modify the Xapian index at a time, any number of processes may concurrently read. |
<!-- Edit below this comment --> | <!-- Edit below this comment --> | ||
Line 102: | Line 104: | ||
<!-- Pod2Wiki= --> | <!-- Pod2Wiki= --> | ||
<!-- Pod2Wiki=head_parameters --> | <!-- Pod2Wiki=head_parameters --> | ||
+ | |||
==PARAMETERS== | ==PARAMETERS== | ||
* lang | * lang |
Latest revision as of 07:03, 8 September 2015
EPrints 3 Reference: Directory Structure - Metadata Fields - Repository Configuration - XML Config Files - XML Export Format - EPrints data structure - Core API - Data Objects
Latest Source Code (3.4, 3.3) | Revision Log | Before editing this page please read Pod2Wiki
Contents
NAME
EPrints::Plugin::Search::Xapian
DESCRIPTION
Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.
Xapian currently only supports simple searches.
Xapian simple searches are parsed by the Xapian query parser which supports prefixes for search terms:
title:(eagle buzzard) abstract:"london wetlands"
The field prefixes are taken from the search configuration and constrain the following term (or bracketed terms) to that field only. If no prefix is given the entire Xapian index will be used i.e. it will search any indexed term, not just those from the search configuration fields. For example, the following simple search configuration:
search_fields => [ { id => "q", meta_fields => [ "documents", "title", "abstract", "creators_name", "date" ] }, ],
Allows the user to specify "documents", "title", "abstract", "creators_name" or "date" as a prefix to a search term. Omitting a prefix will match any field e.g. "publisher".
Terms can be negated by prefixing the term with '-':
eagle -buzzard
Phrases can be specified by using quotes, for example "Southampton University" won't match University of Southampton.
Terms are stemmed by default ('bubbles' becomes 'bubble') except if you use the term in a phrase.
Partial matches are supported by using '*':
ameri* - americans, americas, amerillo etc.
Xapian search results are returned in a sub-class of EPrints::List (a wrapper around a Xapian enquire object). Calling EPrints::List/count will return an estimate of the total matches.
As Xapian has a higher 'qs' score than Internal it will (once enabled) override the default EPrints simple search. You can override this behaviour in cfg.d/plugins.pl:
$c->{plugins}{'Search::Xapian'}{params}{qs} = .1;
Or disable completely (including disabling indexing):
$c->{plugins}{'Search::Xapian'}{params}{disable} = 1;
USAGE
Install the Search::Xapian extension. Note: there are two Perl bindings available for Xapian. The CPAN version is older and based on Perl-XS. xapian-bindings-perl available from xapian.org is based on SWIG and has better coverage of the API. Regardless, for the best feature support/performance it is highly recommended to have the latest stable version of the Xapian library.
Xapian uses a separate (from MySQL) index that is stored in archives/[archiveid]/var/xapian. To build the Xapian index you will need to reindex:
./bin/epadmin reindex [archiveid] eprint
(Repeat for any other datasets you expect to use Xapian with.)
The var/xapian/ directory should contain something like:
flintlock position.baseA position.DB postlist.baseB record.baseA record.DB termlist.baseB iamchert position.baseB postlist.baseA postlist.DB record.baseB termlist.baseA termlist.DB
The indexing process for Xapian is in lib/cfg.d/search_xapian.pl. This can be overridden by dropping the same-named file into your repository archives/[archiveid]/cfg.d/. If the Xapian search is not matching what you might expect it to, you probably need to fix the indexing process (and re-index!). Terms indexed by Xapian can also be weighted to e.g. give names a higher weighting than abstract text.
You will need to restart your Apache server to enable the Xapian search plugin and dependencies.
If the Xapian search is working correctly you will have a "by relevance" option available in the ordering of simple search results.
Lock Files
Xapian maintains a lock file in archives/[archiveid]/var/xapian. If you see indexing errors about not being able to lock the database ensure you aren't running multiple copies of the EPrints indexer. If no other processes are running you may need to manually remove the lock file from the var/xapian directory. While only one process may modify the Xapian index at a time, any number of processes may concurrently read.
PARAMETERS
- lang
- Override the default language used for stemming.
- stopwords
- An array reference of stop words to use (defaults to English).
METHODS
stemmer
$stemmer = $plugin->stemmer()
Returns a Search::Xapian::Stem for the default language.
stopper
$stopper = $plugin->stopper()
Returns a Search::Xapian::SimpleStopper for stopwords
.
COPYRIGHT
Copyright 2000-2011 University of Southampton.
This file is part of EPrints http://www.eprints.org/.
EPrints is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
EPrints is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with EPrints. If not, see http://www.gnu.org/licenses/.