Difference between revisions of "Using Issues for Quality Control"
|  (→Reference: Issues identified (EPrints 3.1)) |  (Switched to API page for bin/issues_audit.) | ||
| (69 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
| − | + | [[Category:Howto]] | |
| − | + | EPrints 3.1 introduced a new tool to assist with the quality control process: EPrints can automatically identify potential issues with items in the repository, and bring them to the attention of the repository editor or administrator. This page describes how to use the new tool and provides a behind-the-scenes look at how it works, including configuration and  code examples. | |
| − | EPrints 3.1  | ||
| − | |||
| − | EPrints can automatically identify potential issues with  | ||
| − | |||
| − | This page describes how  | ||
| ==Working with Issues== | ==Working with Issues== | ||
| Line 11: | Line 6: | ||
| ===Issues with individual items=== | ===Issues with individual items=== | ||
| − | EPrints reports any  | + | EPrints reports any issues with an individual item on the item control page. This can be useful for: | 
| * a depositor editing an item before submitting to review | * a depositor editing an item before submitting to review | ||
| Line 29: | Line 24: | ||
| [[Image:Issues_details.png]] | [[Image:Issues_details.png]] | ||
| − | + | As a depositor, editor or administrator you can then use the normal EPrints editing options to correct the issue - see Issues Reference below for guidlines. | |
| + | |||
| + | ====Issues Tab in Detail==== | ||
| + | |||
| + | The issues tab actually has two sections: | ||
| + | |||
| + | * Current Issues | ||
| + | * Recorded Issues | ||
| + | |||
| + | The "Current Issues" section lists the "live" issues with the item; "Recorded Issues" lists the issues recorded by EPrints the last time the "issue discovery" process was run (typically this should be scheduled to run nightly - see "Checklist" below. | ||
| + | |||
| + | '''Technical Note: Issues come in two flavours - XML-based issues and Plugin-based issues. The "Issues Reference" section below lists the issues that EPrints will recognise by default and indicates whether the issue is XML or plugin based. Only XML-based issues will appear in the "Current Issues" section. Both XML and plugin based issues will appear in the "Recorded Issues" section. | ||
| + | |||
| + | In the example below, EPrints found 3 issues when the issue discovery process ran at 11:46AM - these issues are listed in the "Recorded Issues" section. The editor has subsequently resolved the "Missing or one character family name" issue (#2) so this is no longer shown in the "Current Issues" section. Note that the "Duplicate title" issue only appears in the "Recorded Issues" section because it is a plugin-based issue (see Reference section below). The next time the issue discovery process runs, the "Missing or one character family name" issue will no longer be detected, and no longer be displayed in the "Recorded Issues" section.  | ||
| + | |||
| + | [[Image:issues_sections.png]] | ||
| ===Searching for Issues=== | ===Searching for Issues=== | ||
| − | Some issues may be time dependent, so will not be identified  | + | Some issues may be time dependent, so will not be identified at the point of deposit or during the editorial review process. For example an item that is deposited as "In Press" (ie. due to be published) in 2007 may need attention if it is still "In Press" 2 years later - by that time the item will probably have been published and there may be extra metadata available (such as page numbers) which could be added to the item record.   | 
| − | + | Furthermore, Repository administrators upgrading to EPrints 3.1 may want to revisit their existing records and use the new system to identify any potential problems. | |
| − | EPrints provides an Issues Search function that is available to  | + | EPrints provides an Issues Search function that is available to Repository Administrators via the Admin screen: | 
| [[Image:Issues_search_button.png]] | [[Image:Issues_search_button.png]] | ||
| − | The Issues Search allows the repository to be searched using a number of criteria, including issue type (see Reference below), item publication date, type and subject headings. It is also possible to limit the search to one of the 4 areas of the repository -  | + | The Issues Search allows the repository to be searched using a number of different criteria, including issue type (see Reference below), item publication date, type and subject headings. It is also possible to limit the search to one of the 4 areas of the repository - in the example below the Repository Administrator searches for all issues related to items in the live area of the repository: | 
| [[Image:Issues_search_form.png]] | [[Image:Issues_search_form.png]] | ||
| Line 49: | Line 59: | ||
| [[Image:Issues_search_results.png]] | [[Image:Issues_search_results.png]] | ||
| − | ===Issues Reference  | + | ===Issues Reference=== | 
| + | |||
| + | This table lists the issues that EPrints will automatically recognise: | ||
| − | {| | + | {| border="1" cellspacing="0" cellpadding="10" | 
| − | ! | + | ! | 
| − | !Reported  | + | !EPrints Version | 
| − | !Suggested  | + | !Type | 
| + | !Reported When | ||
| + | !Suggested Action | ||
| + | |- | ||
| + | !Old but not published | ||
| + | |3.1 | ||
| + | |XML | ||
| + | |Publication Status is  "Submitted" or "In Press" and Date is more than 2 years ago | ||
| + | |If the item was published, change its Publication Status to "Published" and enter/verify any additional metadata related to its publication (for example, page numbers, volume, issue, publication date). If the item was not published, change its Publication Status to "Unpublished". | ||
| + | |- | ||
| + | !Short family name | ||
| + | |3.1 | ||
| + | |XML | ||
| + | |An author's family (sur)name is missing or very short (1 character) | ||
| + | |This issue is intended to identify author names that have been entered back-to-front, for example the author's initial is entered in the family name field and surname in the given name field. Verify that the author name is correct, edit if necessary. | ||
| + | |- | ||
| + | !Duplicate title | ||
| + | |3.1 | ||
| + | |Plugin | ||
| + | |The title of the item is exactly the same as another item | ||
| + | |Verify that the items do not describe the same work. If the item is a duplicate, move it to the Retired area of the repository or delete it outright. | ||
| |- | |- | ||
| + | !Similar title | ||
| + | |3.1 | ||
| + | |Plugin | ||
| + | |The title of the item is similar to another item | ||
| + | |Verify that the items do not describe the same work. If the item is a duplicate, move it to the Retired area of the repository or delete it outright. | ||
| |} | |} | ||
| − | + | ==Behind the scenes== | |
| + | |||
| + | Issues can be defined in two ways: issues that relate to the properties of an individual item are defined in a configuration file using an XML syntax; issues that require a larger scope (such as comparing the item to other items in the repository) are implemented as EPrints plugins. | ||
| − | + | The "Issues" tab on the details page has 2 sections (see "Issues Tab in Detail" above): | |
| − | + | * Current Issues | |
| + | * Issues Recorded | ||
| − | ''' | + | Any XML-defined issues are displayed in the "Current Issues" section immediately. A command line script called "[[API:bin/issues_audit|bin/issues_audit]]" must be run to process the plugin-based issues - any identified are recorded and will then appear in the "Issues Recorded" section, '''Note there is a special plugin which checks for each of the XML-defined issues - see Plugin Reference below - hence XML defined issues can appear in both the "Current Issues" and "Recorded Issues" sections'''. '''Note that this script must be run regularly in order for the "Issues Search" to function correctly'''. Therefore, the "Current Issues" section reflects the "live" issues for the item whereas the "Issues Recorded" section lists the issues that were reported the last time the issues_audit script was run. | 
| − | + | ===issues.xml=== | |
| − | + | Issues that relate to the properties of an individual item are defined in the configuration file ''archives/ARCHIVEID/cfg/issues.xml''. | |
| − | ' | + | The issues.xml file uses the [[EPrints_Control_Format|EPrints Control Syntax]] and [[EPScript|EPrints Script]] to define a series of tests that can be carried out on an item's metadata to work out which issues an item has. Each issue is then defined using an "<issue>..</issue>" construct. | 
| − | + | '''Example 1''' Test the metadata fields '''date''' and '''ispublished''' - if the date is at least 2 years in the past and ispublished has the value "submitted" or "inpress", report the "old_but_not_published" issue for the item (this example is taken from the default issues.xml in EPrints 3.1.3). | |
| − | + |    <epc:if test="date.is_set() and date lt today().datemath( -2,'year' ) and ( ispublished = 'submitted' or ispublished = 'inpress' )"> | |
| + |       <issue type="old_but_not_published">Date is <strong><epc:print expr="date" /></strong> but item is still marked as <strong><epc:print expr="ispublished" /></strong>.</issue> | ||
| + |    </epc:if> | ||
| − | ''' | + | '''Example 2''' An <issue> without a conditional test will always be reported. | 
| − | + |  <issue type="always_reported">This issue is always reported.</issue> | |
| − | + | [[Image:Always_reported.png]] | |
| − | == | + | '''Example 3''' Report articles that do not have an ISSN. | 
| + | |||
| + |    <epc:if test="type='article' and ispublished='pub' and !issn.is_set()"> | ||
| + |      <issue type="published_but_no_issn">Item is published but has no ISSN. </issue> | ||
| + |    </epc:if> | ||
| + | |||
| + | ===Issues plugins=== | ||
| + | |||
| + | The default EPrints Issues plugins can be found in ''perl_lib/EPrints/Plugin/Issues/''. Local repository plugins should be stored in ''archives/ARCHIVEID/cfg/plugins/EPrints/Plugin/Issues/''. | ||
| + | |||
| + | Each plugin receives a list of EPrints to test. Plugins can implement some or all of the following methods: | ||
| + | |||
| + | * '''process_item_in_list''' - record issues/details for whole list | ||
| + | * '''item_issues''' - return a list of issues for the given item | ||
| + | * '''process_at_end''' - add any additional issues after the list has been processed | ||
| + | |||
| + | If a plugin implements '''process_item_in_list'' this will be called for each item in the list, otherwise, '''item_issues''' will be called for each item in the list. '''process_at_end''' will be called after the whole list has been processed. | ||
| + | |||
| + | '''Example 1''' Use '''process_item_in_list''' to keep a record of all the titles encountered; check for duplicate titles in '''process_at_end''' and create issue for each matching item (code excerpts from the EPrints 3.1.3 default ExactTitleDups.pm plugin): | ||
| + | |||
| + |  sub process_item_in_list | ||
| + |  { | ||
| + |         ... | ||
| + |         my $title = $item->get_value( "title" ); | ||
| + |         push @{$info->{titlemap}->{$title}}, $item->get_id; | ||
| + |  } | ||
| + | |||
| + |  sub process_at_end | ||
| + |  { | ||
| + |         ... | ||
| + |         foreach my $code ( keys %{$info->{titlemap}} ) | ||
| + |         { | ||
| + |                 my @set = @{$info->{titlemap}->{$code}}; | ||
| + |                 next unless scalar @set > 1; | ||
| + |                 foreach my $id ( @set ) | ||
| + |                 { | ||
| + |                                 ... | ||
| + |                                 push @{$info->{issues}->{$id2}}, { | ||
| + |                                         type => "duplicate_title", | ||
| + |                                         id => "duplicate_title_$id", | ||
| + |                                         description => $desc, | ||
| + |                                 }; | ||
| + |                                 ... | ||
| + |                 } | ||
| + |         } | ||
| + |   } | ||
| + | |||
| + | '''Example 2''' Implement the item_issues method to check for PDF documents where the depositor has accidentally uploaded more than 1 file. | ||
| + | |||
| + |  sub item_issues | ||
| + |  { | ||
| + |         my( $plugin, $dataobj ) = @_; | ||
| + | |||
| + |         my @issues; | ||
| + |         foreach my $doc ( $dataobj->get_all_documents ) | ||
| + |         { | ||
| + |                 my $format = $doc->get_value( "format" ); | ||
| + |                 my %files = $doc->files; | ||
| + |                 if( $format eq "application/pdf" && scalar keys %files > 1 ) | ||
| + |                 { | ||
| + |                         push @issues, { | ||
| + |                                 id => "extra_files_" . $dataobj->get_id, | ||
| + |                                 type => "extra_files", | ||
| + |                                 description => "Extra files uploaded to PDF document", | ||
| + |                         }; | ||
| + |                 } | ||
| + |         } | ||
| + |         return @issues; | ||
| + |  } | ||
| + | [[Image:Issues_extra.png]] | ||
| − | === | + | ===Plugin Reference=== | 
| − | + | This table lists the default EPrints Issue plugins: | |
| − | + | {| border="1" cellspacing="0" cellpadding="10" | |
| + | !Name | ||
| + | !EPrints Version | ||
| + | !Description | ||
| + | |- | ||
| + | |ExactTitleDups.pm | ||
| + | |3.1.3 | ||
| + | |Finds items with identical titles, records issue on each matching item | ||
| + | |- | ||
| + | |SimilarTitles.pm | ||
| + | |3.1.3 | ||
| + | |Find items with similar titles, records issue on each matching item | ||
| + | |- | ||
| + | |XMLConfig.pm | ||
| + | |3.1.3 | ||
| + | |Apply rules in issues.xml to each item, record each issue | ||
| + | |} | ||
| − | Issues  | + | '''All default plugins are located in perl_lib/EPrints/Plugin/Issues/ | 
| − | + | ==Checklist for Repository Administrators== | |
| − | issues_audit | + | # Define any additional item level issues in issues.xml | 
| + | # Develop plugins for other issues as required | ||
| + | # Schedule issues_audit script to run regularly using cron | ||
Latest revision as of 12:53, 5 March 2022
EPrints 3.1 introduced a new tool to assist with the quality control process: EPrints can automatically identify potential issues with items in the repository, and bring them to the attention of the repository editor or administrator. This page describes how to use the new tool and provides a behind-the-scenes look at how it works, including configuration and code examples.
Contents
Working with Issues
Issues with individual items
EPrints reports any issues with an individual item on the item control page. This can be useful for:
- a depositor editing an item before submitting to review
- an editor reviewing an item before making it live
- an editor or administrator viewing/editing an item in the live archive
The number of issues identified appears on the Details tab:
A new "Issues" tab is also available:
Selecting this tab will list the full details of the issues identified with the item:
As a depositor, editor or administrator you can then use the normal EPrints editing options to correct the issue - see Issues Reference below for guidlines.
Issues Tab in Detail
The issues tab actually has two sections:
- Current Issues
- Recorded Issues
The "Current Issues" section lists the "live" issues with the item; "Recorded Issues" lists the issues recorded by EPrints the last time the "issue discovery" process was run (typically this should be scheduled to run nightly - see "Checklist" below.
Technical Note: Issues come in two flavours - XML-based issues and Plugin-based issues. The "Issues Reference" section below lists the issues that EPrints will recognise by default and indicates whether the issue is XML or plugin based. Only XML-based issues will appear in the "Current Issues" section. Both XML and plugin based issues will appear in the "Recorded Issues" section.
In the example below, EPrints found 3 issues when the issue discovery process ran at 11:46AM - these issues are listed in the "Recorded Issues" section. The editor has subsequently resolved the "Missing or one character family name" issue (#2) so this is no longer shown in the "Current Issues" section. Note that the "Duplicate title" issue only appears in the "Recorded Issues" section because it is a plugin-based issue (see Reference section below). The next time the issue discovery process runs, the "Missing or one character family name" issue will no longer be detected, and no longer be displayed in the "Recorded Issues" section.
Searching for Issues
Some issues may be time dependent, so will not be identified at the point of deposit or during the editorial review process. For example an item that is deposited as "In Press" (ie. due to be published) in 2007 may need attention if it is still "In Press" 2 years later - by that time the item will probably have been published and there may be extra metadata available (such as page numbers) which could be added to the item record.
Furthermore, Repository administrators upgrading to EPrints 3.1 may want to revisit their existing records and use the new system to identify any potential problems.
EPrints provides an Issues Search function that is available to Repository Administrators via the Admin screen:
The Issues Search allows the repository to be searched using a number of different criteria, including issue type (see Reference below), item publication date, type and subject headings. It is also possible to limit the search to one of the 4 areas of the repository - in the example below the Repository Administrator searches for all issues related to items in the live area of the repository:
The search results are presented as a list of items, with each item's issues summarised below:
Issues Reference
This table lists the issues that EPrints will automatically recognise:
| EPrints Version | Type | Reported When | Suggested Action | |
|---|---|---|---|---|
| Old but not published | 3.1 | XML | Publication Status is "Submitted" or "In Press" and Date is more than 2 years ago | If the item was published, change its Publication Status to "Published" and enter/verify any additional metadata related to its publication (for example, page numbers, volume, issue, publication date). If the item was not published, change its Publication Status to "Unpublished". | 
| Short family name | 3.1 | XML | An author's family (sur)name is missing or very short (1 character) | This issue is intended to identify author names that have been entered back-to-front, for example the author's initial is entered in the family name field and surname in the given name field. Verify that the author name is correct, edit if necessary. | 
| Duplicate title | 3.1 | Plugin | The title of the item is exactly the same as another item | Verify that the items do not describe the same work. If the item is a duplicate, move it to the Retired area of the repository or delete it outright. | 
| Similar title | 3.1 | Plugin | The title of the item is similar to another item | Verify that the items do not describe the same work. If the item is a duplicate, move it to the Retired area of the repository or delete it outright. | 
Behind the scenes
Issues can be defined in two ways: issues that relate to the properties of an individual item are defined in a configuration file using an XML syntax; issues that require a larger scope (such as comparing the item to other items in the repository) are implemented as EPrints plugins.
The "Issues" tab on the details page has 2 sections (see "Issues Tab in Detail" above):
- Current Issues
- Issues Recorded
Any XML-defined issues are displayed in the "Current Issues" section immediately. A command line script called "bin/issues_audit" must be run to process the plugin-based issues - any identified are recorded and will then appear in the "Issues Recorded" section, Note there is a special plugin which checks for each of the XML-defined issues - see Plugin Reference below - hence XML defined issues can appear in both the "Current Issues" and "Recorded Issues" sections. Note that this script must be run regularly in order for the "Issues Search" to function correctly. Therefore, the "Current Issues" section reflects the "live" issues for the item whereas the "Issues Recorded" section lists the issues that were reported the last time the issues_audit script was run.
issues.xml
Issues that relate to the properties of an individual item are defined in the configuration file archives/ARCHIVEID/cfg/issues.xml.
The issues.xml file uses the EPrints Control Syntax and EPrints Script to define a series of tests that can be carried out on an item's metadata to work out which issues an item has. Each issue is then defined using an "<issue>..</issue>" construct.
Example 1 Test the metadata fields date and ispublished - if the date is at least 2 years in the past and ispublished has the value "submitted" or "inpress", report the "old_but_not_published" issue for the item (this example is taken from the default issues.xml in EPrints 3.1.3).
  <epc:if test="date.is_set() and date lt today().datemath( -2,'year' ) and ( ispublished = 'submitted' or ispublished = 'inpress' )">
     <issue type="old_but_not_published">Date is <epc:print expr="date" /> but item is still marked as <epc:print expr="ispublished" />.</issue>
  </epc:if>
Example 2 An <issue> without a conditional test will always be reported.
<issue type="always_reported">This issue is always reported.</issue>
Example 3 Report articles that do not have an ISSN.
  <epc:if test="type='article' and ispublished='pub' and !issn.is_set()">
    <issue type="published_but_no_issn">Item is published but has no ISSN. </issue>
  </epc:if>
Issues plugins
The default EPrints Issues plugins can be found in perl_lib/EPrints/Plugin/Issues/. Local repository plugins should be stored in archives/ARCHIVEID/cfg/plugins/EPrints/Plugin/Issues/.
Each plugin receives a list of EPrints to test. Plugins can implement some or all of the following methods:
- process_item_in_list - record issues/details for whole list
- item_issues - return a list of issues for the given item
- process_at_end - add any additional issues after the list has been processed
If a plugin implements process_item_in_list this will be called for each item in the list, otherwise, item_issues' will be called for each item in the list. process_at_end will be called after the whole list has been processed.
Example 1 Use process_item_in_list to keep a record of all the titles encountered; check for duplicate titles in process_at_end and create issue for each matching item (code excerpts from the EPrints 3.1.3 default ExactTitleDups.pm plugin):
sub process_item_in_list
{
       ...
       my $title = $item->get_value( "title" );
       push @{$info->{titlemap}->{$title}}, $item->get_id;
}
sub process_at_end
{
       ...
       foreach my $code ( keys %{$info->{titlemap}} )
       {
               my @set = @{$info->{titlemap}->{$code}};
               next unless scalar @set > 1;
               foreach my $id ( @set )
               {
                               ...
                               push @{$info->{issues}->{$id2}}, {
                                       type => "duplicate_title",
                                       id => "duplicate_title_$id",
                                       description => $desc,
                               };
                               ...
               }
       }
 }
Example 2 Implement the item_issues method to check for PDF documents where the depositor has accidentally uploaded more than 1 file.
sub item_issues
{
       my( $plugin, $dataobj ) = @_;
 
       my @issues;
       foreach my $doc ( $dataobj->get_all_documents )
       {
               my $format = $doc->get_value( "format" );
               my %files = $doc->files;
               if( $format eq "application/pdf" && scalar keys %files > 1 )
               {
                       push @issues, {
                               id => "extra_files_" . $dataobj->get_id,
                               type => "extra_files",
                               description => "Extra files uploaded to PDF document",
                       };
               }
       }
       return @issues;
}
Plugin Reference
This table lists the default EPrints Issue plugins:
| Name | EPrints Version | Description | 
|---|---|---|
| ExactTitleDups.pm | 3.1.3 | Finds items with identical titles, records issue on each matching item | 
| SimilarTitles.pm | 3.1.3 | Find items with similar titles, records issue on each matching item | 
| XMLConfig.pm | 3.1.3 | Apply rules in issues.xml to each item, record each issue | 
All default plugins are located in perl_lib/EPrints/Plugin/Issues/
Checklist for Repository Administrators
- Define any additional item level issues in issues.xml
- Develop plugins for other issues as required
- Schedule issues_audit script to run regularly using cron









