Difference between revisions of "Using Issues for Quality Control"

From EPrints Documentation
Jump to: navigation, search
(Checklist for Sysadmins)
(issues.xml)
Line 112: Line 112:
 
Each issue is then defined using an "<issue>..</issue>" construct.
 
Each issue is then defined using an "<issue>..</issue>" construct.
  
Here is an example from the EPrints 3.1 default issues.xml file:
+
'''Example 1''' In this case, the metadata fields '''date''' and '''ispublished''' are tested. If the date at least 2 years in the past and ispublished has the value "submitted" or "inpress", the "old_but_not_published" issue is reported for the item.
  
 
   <epc:if test="date.is_set() and date lt today().datemath( -2,'year' ) and ( ispublished = 'submitted' or ispublished = 'inpress' )">
 
   <epc:if test="date.is_set() and date lt today().datemath( -2,'year' ) and ( ispublished = 'submitted' or ispublished = 'inpress' )">
Line 118: Line 118:
 
   </epc:if>
 
   </epc:if>
  
In this case, the metadata fields '''date''' and '''ispublished''' are tested. If the date at least 2 years in the past and ispublished has the value "submitted" or "inpress", the "old_but_not_published" issue is reported for the item.
+
'''Example 2''' An <issue> without a conditional test will always be reported.
 
+
'''Example 1
+
  
 
  <issue type="always_reported">This issue is always reported.</issue>
 
  <issue type="always_reported">This issue is always reported.</issue>
 
An <issue> without a conditional test will always be reported.
 
  
 
[[Image:Always_reported.png]]
 
[[Image:Always_reported.png]]
  
'''Example 2
+
'''Example 3 Identify articles that do not have an ISSN.
  
 
   <epc:if test="type='article' and ispublished='pub' and !issn.is_set()">
 
   <epc:if test="type='article' and ispublished='pub' and !issn.is_set()">
 
     <issue type="published_but_no_issn">Item is published but has no ISSN. </issue>
 
     <issue type="published_but_no_issn">Item is published but has no ISSN. </issue>
 
   </epc:if>
 
   </epc:if>
 
Identify articles that do not have an ISSN.
 
  
 
===Issues plugins===
 
===Issues plugins===

Revision as of 15:06, 23 November 2009

EPrints 3.1 introduces a new feature to assist with the quality control process.

EPrints can automatically identify potential issues with an item and bring them to the attention of the repository editor or administrator.

This page describes how issues are shown in the Web UI and how the process of identifying issues works behind the scenes.

Working with Issues

Issues with individual items

EPrints reports any potential issues with an individual item on the item control page. This can be useful for:

  • a depositor editing an item before submitting to review
  • an editor reviewing an item before making it live
  • an editor or administrator viewing/editing an item in the live archive

The number of issues identified appears on the Details tab:

Issues count.png

A new "Issues" tab is also available:

Issues tab.png

Selecting this tab will list the full details of the issues identified with the item:

Issues details.png

The depositor, editor or administrator can then use the editing options to correct the issue (see Reference below).

Issues Tab in Detail

The issues tab has two sections:

  • Current Issues
  • Recorded Issues

The "Current Issues" section lists the "live" issues with the item; "Recorded Issues" lists the issues recorded by EPrints the last time the issue detection script was run.

Technical Note: Issues come in two flavours - XML-based issues and Plugin-based issues. The Reference section below list the issues that EPrints is recognises by default and indicates whether the issue is XML or plugin based. Only XML-based issues will appear in the "Current Issues" section. Both XML and plugin based issues will appear in the "Recorded Issues" section.

In the example below, EPrints recorded 3 issues when the issue detection script ran at 11:46AM - these issues are listed in the "Recorded Issues" section. The editor has subsequently resolved the "Missing or one character family name" issue so this is no longer shown in the "Current Issues" section. Note that the "Duplicate title" issue only appears in the "Recorded Issues" section because it is a plugin-based issue (see Reference section below).

Issues sections.png

Searching for Issues

Some issues may be time dependent, so will not be identified up at the point of deposit or during the editorial review process. For example an item that is listed as "in press" (ie. due to be published) in 2007 may need attention if it is still in press 2 years later - by that time the item will probably have been published and there may be extra metadata available (such as page numbers) which can be added to the item record.

Also, Repository administrators upgrading to EPrints 3,1 may want to revisit their existing records and use the new system to identify any potential problems.

EPrints provides an Issues Search function that is available to editors and administrators via the Admin screen:

Issues search button.png

The Issues Search allows the repository to be searched using a number of criteria, including issue type (see Reference below), item publication date, type and subject headings. It is also possible to limit the search to one of the 4 areas of the repository - here the repository administrator searches for all issues related to items in the live area of the repository:

Issues search form.png

The search results are presented as a list of items, with each item's issues summarised below:

Issues search results.png

Issues Reference (EPrints 3.1)

Type Reported When Suggested Action
Old but not published XML Publication Status is "Submitted" or "In Press" and Date is more than 2 years ago If the item was published, change its Publication Status to "Published" and enter/verify any additional metadata related to its publication (for example, page numbers, volume, issue, publication date). If the item was not published, change its Publication Status to "Unpublished".
Short family name XML An author's family (sur)name is missing or very short (1 character) This issue is intended to identify author names that have been entered back-to-front, for example the author's initial is entered in the family name field and surname in the given name field. Verify that the author name is correct, edit if necessary.
Duplicate title Plugin The title of the item is exactly the same as another item Verify that the items do not describe the same work. If the item is a duplicate, move it to the Retired area of the repository or delete it outright.
Similar title Plugin The title of the item is similar to another item Verify that the items do not describe the same work. If the item is a duplicate, move it to the Retired area of the repository or delete it outright.

Behind the scenes

Issues can be defined in two ways: issues that relate to the properties of an individual item are defined using an XML syntax; issues that require a larger scope (such as comparing the item to other items in the repository) are implemented as plugins.

The "Issues" tab on the details page has 2 sections (see "Issues Tab in Detail" above):

  • Current Issues
  • Issues Recorded

Issues defined in the XML syntax are displayed in the "Current Issues" immediately.

A command line script called "issues_audit" must be run to process the plugin-based issues - any identified are recorded and will then appear in the "Issues Recorded" section (Note there is a special plugin which applies the XML defined issues to each item - see Plugin Reference below - hence XML defined issues can appear in both the "Current Issues" section and the "Recorded Issues" section). This script must be run in order for the "Issues Search" function to work.

So the "Current Issues" section reflects the "live" issues for the item whereas the "Issues Recorded" section lists the issues that were reported the last time the issues_audit script was run.

issues.xml

The issues.xml file (archives/ARCHIVEID/cfg/issues.xml) uses the EPrints_Control_Format and EPScript to define a series of tests that are carried out on an item's metadata.

Each issue is then defined using an "<issue>..</issue>" construct.

Example 1 In this case, the metadata fields date and ispublished are tested. If the date at least 2 years in the past and ispublished has the value "submitted" or "inpress", the "old_but_not_published" issue is reported for the item.

  <epc:if test="date.is_set() and date lt today().datemath( -2,'year' ) and ( ispublished = 'submitted' or ispublished = 'inpress' )">
     <issue type="old_but_not_published">Date is <epc:print expr="date" /> but item is still marked as <epc:print expr="ispublished" />.</issue>
  </epc:if>

Example 2 An <issue> without a conditional test will always be reported.

<issue type="always_reported">This issue is always reported.</issue>

Always reported.png

Example 3 Identify articles that do not have an ISSN.

  <epc:if test="type='article' and ispublished='pub' and !issn.is_set()">
    <issue type="published_but_no_issn">Item is published but has no ISSN. </issue>
  </epc:if>

Issues plugins

The default EPrints Issues plugins are stored in perl_lib/EPrints/Plugin/Issues/.

Local repository plugins should be stored in archives/ARCHIVEID/cfg/plugins/EPrints/Plugin/Issues/.

Plugins can implement some or all of the following methods:

  • process_item_in_list - record issues/details for whole list
  • item_issues - return a list of issues for the given item
  • process_at_end - add any additional issues

Each plugin receives a list of EPrints to test. If the plugin has implemented process_item_in_list this will be called for item in the list; otherwise, item_issues will be called for each item in the list. process_at_end will be called after the whole list has been processed.

Here's an excerpt from the default ExactTitleDups plugin:

sub process_item_in_list
{
       ...
       my $title = $item->get_value( "title" );
       push @{$info->{titlemap}->{$title}}, $item->get_id;
}

sub process_at_end
{
       ...
       foreach my $code ( keys %{$info->{titlemap}} )
       {
               my @set = @{$info->{titlemap}->{$code}};
               next unless scalar @set > 1;
               foreach my $id ( @set )
               {
                               ...
                               push @{$info->{issues}->{$id2}}, {
                                       type => "duplicate_title",
                                       id => "duplicate_title_$id",
                                       description => $desc,
                               };
                               ...
               }
       }
 }

This plugin uses process_item_in_list to keep a record of all the titles it encounters. process_at_end then checks for duplicate titles and creates and issue for each matching item.

Example 1

sub item_issues
{
       my( $plugin, $dataobj ) = @_;
 
       my @issues;
       foreach my $doc ( $dataobj->get_all_documents )
       {
               my $format = $doc->get_value( "format" );
               my %files = $doc->files;
               if( $format eq "application/pdf" && scalar keys %files > 1 )
               {
                       push @issues, {
                               id => "extra_files_" . $dataobj->get_id,
                               type => "extra_files",
                               description => "Extra files uploaded to PDF document",
                       };
               }
       }
       return @issues;
}

This plugin implements the item_issues method to check for PDF documents where the depositor has accidentally uploaded more than 1 file.

Issues extra.png

Plugin Reference (EPrints 3.1)

Name Description
ExactTitleDups.pm Finds items with identical titles, records issue on each matching item
SimilarTitles.pm Find items with similar titles, records issue on each matching item
XMLConfig.pm Apply rules in issues.xml to each item, record each issue

Note all default plugins are located in perl_lib/EPrints/Plugin/Issues/

Checklist for Repository Administrators

  1. Add any additional item level issues to issues.xml
  2. Develop plugins for other issues as required
  3. Schedule issues_audit script to run regularly using cron