Contribute: Plugins/ImportPluginsCSV

From EPrints Documentation
Revision as of 19:07, 28 September 2007 by Tom (Talk | contribs)

Jump to: navigation, search

Import Plugin Tutorial 1: CSV

In this tutorial we will look at creating a relatively simple plugin to import eprints into our repository by reading files containing comma-separated values. We won't be dealing with documents and files, but will be focusing on importing eprint metadata.

Import plugins are inherently more complicated than export plugins because of the error checking that must be done, however in this example error checking has been kept to a minimum to simplify the example. In a "real" plugin you should check that the appropriate metadata fields are set for a given type of eprint, and unfortunately there appears to be no quick way to do this.

Before You Start

Create a directory for your import plugins in the main plugin directory (usually /opt/eprints3/perl_lib/EPrints/Plugin/import). The directory used for these examples is called "MyPlugins".

To prepare for this tutorial you should install the Text::CSV module. The following command as root, or using sudo should work.

cpan Text::CSV

CSV.pm

The code in the section below should be placed in a file called CSV.pm in the directory created previously, and MyPlugins should be changed to the name of that directory.

package EPrints::Plugin::Import::MyPlugins::CSV;

use EPrints::Plugin::Import::TextFile;
use strict;

our @ISA = ('EPrints::Plugin::Import::TextFile');

sub new
{
        my( $class, %params ) = @_;

        my $self = $class->SUPER::new( %params );

        $self->{name} = 'CSV';
        $self->{visible} = 'all';
        $self->{produce} = [ 'list/eprint' ];

        my $rc = EPrints::Utils::require_if_exists('Text::CSV');
        unless( $rc )
        {
                $self->{visible} = '';
                $self->{error} = 'Failed to load required module Text::CSV';
        }

        return $self;
}

sub input_fh
{
        my( $plugin, %opts ) = @_;
        my @ids;
        my $fh = $opts{fh};
        my @records = <$fh>;
        my $csv = Text::CSV->new();
        my @fields;

        if ($csv->parse(shift @records))
        {
                @fields = $csv->fields();
        }
        else
        {
                $plugin->error($csv->error_input);
                return undef;
        }

        foreach my $row (@records)
        {
                my @input_data = (join(',',@fields),$row);

                my $epdata = $plugin->convert_input(\@input_data);
                next unless defined $epdata;

                my $dataobj = $plugin->epdata_to_dataobj($opts{dataset},$epdata);
                if( defined $dataobj )
                {
                        push @ids, $dataobj->get_id;
                }
        }

        return EPrints::List->new(
                        dataset => $opts{dataset},
                        session => $plugin->{session},
                        ids=>\@ids );
}

sub convert_input
{
        my $plugin = shift;
        my @input = @{shift @_};
        my $csv = Text::CSV->new();

        my @record;
        if ($csv->parse($input[1]))
        {
                @record = $csv->fields();
        }
        else
        {
                $plugin->error($csv->error_input);
                return undef;
        }

        my @fields = split(',',$input[0]);

        if (scalar @fields != scalar @record)
        {
                $plugin->warning('Row length mismatch');
                return undef;
        }

        my %output = ();

        my $dataset = $plugin->{session}->{repository}->get_dataset('archive');

        my $i = 0;
        foreach my $field (@fields)
        {
                unless ($dataset->has_field($field))
                {
                        $i++;
                        next;
                }

                my $metafield = $dataset->get_field($field);

                if ($metafield->get_property('multiple'))
                {
                        my @values = split(';',$record[$i]);

                        if ($metafield->{type} eq 'name')
                        {
                                my @names = ();

                                foreach my $value (@values)
                                {
                                        my $name = $value;

                                        next unless ($value =~ /^(.*?),(.*?)(,(.*?))?$/);
                                        push @names, {family => $1,given => $2,lineage => $4};
                                }

                                $output{$field} = \@names;
                        }
                        else
                        {
                                $output{$field} = \@values;
                        }
                }
                else
                {
                        $output{$field} = $record[$i];
                }
                $i++;
        }
        return \%output;
}

1;

In More Detail

Modules

Here we import the superclass for our plugin.

use EPrints::Plugin::Import::TextFile;

Inheritance

Our plugin will not inherit from the Import class directly, but from the TextFile subclass. This contains some extra file handling code that means we can ignore certain differences in text file formats. If you are creating an import plugin which imports non-text files you should subclass the EPrints::Plugin::Import class directly.

our @ISA = ('EPrints::Plugin::Import::TextFile');

Constructor

For import plugins we must set a 'produce' field, to tell the repository what kinds of objects the plugin can import. This plugin only supports importing lists of eprints, but if it supported importing individual eprints we could add 'dataobj/eprint' to this property. We would then have to implement the "input_dataobj" method. Most plugins implement this method, but it is rarely used in practice. Most imports are done in lists (even if that list only contains one member), via the import items screen.

        $self->{produce} = [ 'list/eprint' ];

Here we use a module that is not included with EPrints, Text::CSV, so we import it in a different way. First we check that it is installed, and load it if it is with "EPrints::Utils::require_if_exists".If it isn't we make the plugin invisible and produce an error message. It is good practice to import non-standard modules in this way rather than with "use".

        my $rc = EPrints::Utils::require_if_exists('Text::CSV');
        unless( $rc )
        {
                $self->{visible} = '';
                $self->{error} = 'Failed to load required module Text::CSV';
        }

Input

Import plugins have to implement a couple of methods to read data from a file or string, manipulate it and turn it into a form which can be imported into the repository. That process will be described below.

input_fh

This method takes a filehandle, processes it, tries to import DataObjs in to the repository and then returns a List of the DataObjs imported.

This array will be used to create a List of DataObjs later.

        my @ids;

Here we open the filehandle passed, and read the lines into an array.

        my $fh = $opts{fh};
        my @records = <$fh>;

We create a Text::CSV object to handle the input. Using a dedicated CSV handling package is preferable to using Perl's split function as it handles a number of more complicated scenarios such as commas within records using double quotes.

        my $csv = Text::CSV->new();

After setting up an array for metadata field names, we attempt to parse the first line of our file. The parse method does not return an array of fields, but reports success or failure. In the event of success we use the fields method to return the last fields parsed. In the event of failure we use the error_input method to get the last error, and return undef.

        my @fields;
        if ($csv->parse(shift @records))
        {
                @fields = $csv->fields();
        }
        else
        {
                $plugin->error($csv->error_input);
                return undef;
        }

Now that the row of column titles has been dealt with we move onto processing each record in the file.

In import plugins the convert_input method converts individual records into a format that can be imported into the repository. That is a hash whose keys are metadata field names and values are the corresponding values. As a row on its own cannot be imported as we don't know to which field each value belongs we have to construct an array to pass to convert_input first. We pass an array whose first element is the fields row and whose second element is the row we want to import.

        foreach my $row (@records)
        {
                my @input_data = (join(',',@fields),$row);

Here we call convert_input on our constructed input_data. If the conversion fails we simply move to the next record.

                my $epdata = $plugin->convert_input(\@input_data);
                next unless defined $epdata;

The epdata_to_dataobj method takes our epdata hash reference and turns it into a new DataObj in our repository. If it is successful it returns the new DataObj, whose id we add to our array of ids.

                my $dataobj = $plugin->epdata_to_dataobj($opts{dataset},$epdata);
                if( defined $dataobj )
                {
                        push @ids, $dataobj->get_id;
                }

Finally we return a List object containing the ids of the records we have successfully imported.

        return EPrints::List->new(
                        dataset => $opts{dataset},
                        session => $plugin->{session},
                        ids=>\@ids );

convert_input

This method takes data in a particular format, in this case CSV and transforms it into a hash of metadata field names and values.

We take the second argument to the method and convert the array reference into an array.

        my @input = @{shift @_};

Here we setup another Text::CSV object.

        my $csv = Text::CSV->new();

We take the second element of our array and parse it. This is the record we wish to import. If anything goes wrong we return undef.

        my @record;
        if ($csv->parse($input[1]))
        {
                @record = $csv->fields();
        }
        else
        {
                $plugin->error($csv->error_input);
                return undef;
        }

We take the first element and get the field names. We then check that we have the same number of fields names as records.

        my @fields = split(',',$input[0]);

        if (scalar @fields != scalar @record)
        {
                $plugin->warning('Row length mismatch');
                return undef;
        }

This is the hash that we'll return later.

        my %output = ();

For convenience we get the DataSet object.

        my $dataset = $plugin->{session}->{repository}->get_dataset('archive');

We now iterate over the fields.

        my $i = 0;
        foreach my $field (@fields)
        {

If the field does not exist we look at the next one, remembering to increment our index.

                unless ($dataset->has_field($field))
                {
                        $i++;
                        next;
                }

We get the MetaField object corresponding to the current field.

                my $metafield = $dataset->get_field($field);

We deal with multiple field types by separating individual values with a semi-colon.

                if ($metafield->get_property('multiple'))
                {
                        my @values = split(';',$record[$i]);

Name fields are dealt with by using regular expressions and constructing a hash from the parts matched. The plugin expects names to be of the form Surname, Forenames, Lineage (Sr, Jr, III etc).

                        if ($metafield->{type} eq 'name')
                        {
                                my @names = ();

                                foreach my $value (@values)
                                {
                                        my $name = $value;

                                        next unless ($value =~ /^(.*?),(.*?)(,(.*?))?$/);
                                        push @names, {family => $1,given => $2,lineage => $4};
                                }

                                $output{$field} = \@names;
                        }

Multiple fields which are not names are just added to the hash as an array reference.

                                $output{$field} = \@values;

Non-multiple fields are just added to the hash from the array of fields.

                        $output{$field} = $record[$i];

Finally we return a hash reference.

        return \%output;

Testing Your Plugin

After restarting your webserver go to the Import Items screen from the Manage Deposits screen. If you can't find this, make sure you're logged in.

Type this into the "Cut and Paste Records" box:

title,abstract
This is a test title,This is a test abstract
This is another test title,This is another test abstract

Select "CSV" from the Select import format drop down menu and click "Test Run + Import". You should end up at the Manage Deposits screen with the following message being displayed "Import completed: 2 item(s) imported.".

Embedding commas

If you want to include commas in your imports, which is very likely you must enclose the field in double quotations. For example:

title,abstract
An interesting article,"Damn it Jim, I'm a Doctor, not a Perl hacker."

When doing this make sure not to leave any whitespace between a quotation mark and a comma or the import will fail.

Multiple fields

Multiple field types are handled by separating each individual value by a semi-colon, a simple example of this would be the subjects field.

Go back to the import items and proceed as before, but typing this into the "Cut and Paste Records" box:

abstract,title,subjects
Testing,Testing,AI;C;M;F;P
Testing,Testing,AC;DC

After the records have been imported examine each one on the View Item screen. You will find that a list of subjects are given, and that a more descriptive name is given than the code we imported.

Compound fields

Compound fields are fields that have subfields within them, each with their own name. You don't compound fields in one go, but set the components individually. Subfields have names of the form mainfieldname_subfieldname.

One of the most commonly used compound fields is the "Creators" field. It has a names subfield "creators_name" and an ID subfield "creators_id" which is most often used for email addresses.

Here is an example of setting the creators field:

title,creators_names,creators_ids,
Setting compound fields.,"Bloggs, Joe;Doe, John","joe@bloggs.com;john@doe.com"

If you import this record and then examine the view items screen you will find that the "Creators" field has been setup with the values displayed in a table.