Difference between revisions of "GDPR"

From EPrints Documentation
Jump to: navigation, search
(Add 'Agree to privacy and data statement' checkbox on registration or request forms)
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This page has been created to gather information and share code snippets to help EPrints repositories handle GDPR responsibilities.
+
This page has been created to gather information and share code snippets to help EPrints repositories handle GDPR responsibilities. Some of the issues around GDPR will be specific to a repository contents, but others are quite generic. nb. We're not lawyers!
  
==Last Login Time==
+
As a general rule of thumb, this is about people's rights and privacy. Imagine a worst case scenario where the entire filesystem and database is made entirely public -- it's not a great situation, but consider what steps you could take to reduce the damage if that did happen - don't store any information about people that you don't actually need.
Storing the last login time of a user can be useful to identify which users are active and which are not to help ensure data is not being stored longer than is necessary.
+
 
 +
=== Areas of GDPR that may be relevant to EPrints repositories ===
 +
* Giving people a clear privacy policy about how you will use data you gather about them
 +
* Gaining consent for storing and processing data you gather about them
 +
* Defining and enforcing a retention policy to remove information about people that is no longer required
 +
* "Subject Access Requests" where someone wants to know everything you know about them
 +
* "Right to Erasure" where people may request you remove information about them, but there are exceptions to this.
 +
 
 +
=== Information about people ===
 +
EPrints potentially stores information about people in a number of places.
 +
* Users Dataset
 +
* EPrints Metadata
 +
** In the revision XML files storing old versions of metadata.
 +
* In actual documents
 +
* In the "request a copy" Dataset
 +
* Web access logs
 +
** In the History Dataset
 +
** The Access Dataset, IP Addresses can sometimes be used to identify an individual.
 +
** In the Apache logs in the operating system (this will not be dealt with on this page, but should not be forgotten)
 +
 
 +
== Users Dataset ==
 +
This dataset is either populated with a web based sign up form, or in many cases, it is automatically built from an institutional accounts system. These cases need to be addressed in different ways.
  
First a new user field for storing the time is required in user_fields.pl
+
=== Imported users Dataset ===
  
<source lang="perl">
+
First of all, stop importing any fields you don't actually need and remove those fields from EPrints. nb. Removing the field in the config may not remove it from the database (citation required.. how the hell does that work?)
##user_fields.pl
 
  
push @{$c->{fields}->{user}},
+
Next stop importing any users you just don't need -- if they can't do anything while logged in they should never exist in the Users dataset.
    {
 
        'name' => 'last_login',
 
        'type' => 'timestamp',
 
    },
 
};
 
</source>
 
  
And then add the following code to $c->{check_user_password} in user_login.pl to store the time at which a user successfully logs in.
+
Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. have not appeared in the import script for more than 90 days and never made a deposit.
  
<source lang="perl">
+
=== Web signup users Dataset ===
##user_login.pl
 
  
#get user from username
+
You should ensure that the privacy policy is up to date and in line with GDPR best practice.
my $user = EPrints::DataObj::User::user_with_username( $repository, $username );
 
return 0 unless $user;
 
  
#get time and compile a string
+
Stop collecting any information you don't actually need, and remove those fields from the database.
my( @local ) = localtime( time );
 
my ( $sec, $min, $hour, $day, $mon, $year ) = ( $local[0], $local[1], $local[2], $local[3], $local[4]+1, $local[5]+1900 );
 
my $loginTime = "$year-$mon-$day $hour:$min:$sec";
 
  
#store the value
+
Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. accounts more than a year old that have never made a deposit.
$user->set_value( "last_login", $loginTime );
 
$user->commit();
 
  
#return user
+
=== Removing vs Blanking ===
return 1;
 
  
</source>
+
In some cases it may be more desirable to remove all the personal metadata from a record in the Users dataset rather than delete it completely. This has pros and cons.
  
JLRS: An alternative approach (please discuss which is better) is to set the trigger on the loginticket dataset (different field spec from above - bigint):
+
If you remove a user entirely who has deposited EPrints you may wish to reassign those deposits to an admin user, or a fake user called "left".
<source lang="perl">
 
push @{$c->{fields}->{user}},
 
{
 
        name=>"last_login",
 
        type=>"bigint", #same as 'expires' field in EPrints::DataObj::LoginTicket
 
        required=>0,
 
        volatile=>1
 
}
 
;
 
$c->add_dataset_trigger( 'loginticket', EPrints::Const::EP_TRIGGER_CREATED, sub
 
{
 
        my( %args ) = @_;
 
        my( $repo, $loginticket ) = @args{qw( repository dataobj )};
 
  
        # trigger is global - check that current repository 'user' dataset has last_login field to be updated
+
=== Last Login Time===
        return unless $repo->get_dataset( "user" )->has_field( "last_login" );
+
Storing the last login time of a user can be useful to identify which users are active and which are not to help ensure data is not being stored longer than is necessary.
  
        #update volatile field in user record
+
See [[GDPR/Last Login Time]]
        my $user = EPrints::DataObj::User->new( $repo, $loginticket->get_value( "userid" ) );
 
        if( defined $user ){
 
                $user->set_value( "last_login", $loginticket->get_value( "time" ) );
 
                $user->commit();
 
        }
 
}, priority => 100 );
 
</source>
 
  
==Non-Active Users Report==
+
===Non-Active Users Report ===
 
TODO
 
TODO
 
worth having a count of items linked with the user on the report, along with their statuses?
 
worth having a count of items linked with the user on the report, along with their statuses?
  
==Delete User Action==
+
===Delete User Action===
 
*Suggestion* Cron job to select all user accounts which have been inactive over a certain time threshold, check those users for affiliation to a publication (creator, editor, depositor etc.) which is either archive or buffer status and, where no connection exists, remove those user accounts.
 
*Suggestion* Cron job to select all user accounts which have been inactive over a certain time threshold, check those users for affiliation to a publication (creator, editor, depositor etc.) which is either archive or buffer status and, where no connection exists, remove those user accounts.
  
Line 86: Line 71:
 
Possibly a set/namedset, with the available options limited to only one when rendered to a user?
 
Possibly a set/namedset, with the available options limited to only one when rendered to a user?
  
Chris Gutteridge has written a blofg post on GDRP with a few comments about the EPrints Request a Copy feature: http://blog.soton.ac.uk/webteam/2018/05/10/gdpr-preperations/#post-content-1793;char=5429-7155
+
Chris Gutteridge has written a blog post on GDPR with a few comments about the EPrints Request a Copy feature: http://blog.soton.ac.uk/webteam/2018/05/10/gdpr-preperations/#post-content-1793;char=5429-7155
 +
 
 +
==EPrints Metadata==
 +
TODO
 +
==In actual documents==
 +
TODO
 +
==Request a copy Dataset==
 +
This collects an email address and a reason for requesting the document. Without intervention this can be stored indefinitely.
 +
 
 +
You should add a privacy statement to the form to say what you'll do with the information, add a reminder to the email sent to the authors that they shouldn't misuse the contact info, define a retention policy, automate the enforcement of the retention policy.
 +
 
 +
===Adding a Privacy Statement===
 +
TODO
 +
===Altering the email to authors===
 +
TODO
 +
===Automating the removal of old requests===
 +
TODO
 +
 
 +
==History dataset (used for access stats)==
 +
TODO

Latest revision as of 08:46, 14 June 2018

This page has been created to gather information and share code snippets to help EPrints repositories handle GDPR responsibilities. Some of the issues around GDPR will be specific to a repository contents, but others are quite generic. nb. We're not lawyers!

As a general rule of thumb, this is about people's rights and privacy. Imagine a worst case scenario where the entire filesystem and database is made entirely public -- it's not a great situation, but consider what steps you could take to reduce the damage if that did happen - don't store any information about people that you don't actually need.

Areas of GDPR that may be relevant to EPrints repositories

  • Giving people a clear privacy policy about how you will use data you gather about them
  • Gaining consent for storing and processing data you gather about them
  • Defining and enforcing a retention policy to remove information about people that is no longer required
  • "Subject Access Requests" where someone wants to know everything you know about them
  • "Right to Erasure" where people may request you remove information about them, but there are exceptions to this.

Information about people

EPrints potentially stores information about people in a number of places.

  • Users Dataset
  • EPrints Metadata
    • In the revision XML files storing old versions of metadata.
  • In actual documents
  • In the "request a copy" Dataset
  • Web access logs
    • In the History Dataset
    • The Access Dataset, IP Addresses can sometimes be used to identify an individual.
    • In the Apache logs in the operating system (this will not be dealt with on this page, but should not be forgotten)

Users Dataset

This dataset is either populated with a web based sign up form, or in many cases, it is automatically built from an institutional accounts system. These cases need to be addressed in different ways.

Imported users Dataset

First of all, stop importing any fields you don't actually need and remove those fields from EPrints. nb. Removing the field in the config may not remove it from the database (citation required.. how the hell does that work?)

Next stop importing any users you just don't need -- if they can't do anything while logged in they should never exist in the Users dataset.

Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. have not appeared in the import script for more than 90 days and never made a deposit.

Web signup users Dataset

You should ensure that the privacy policy is up to date and in line with GDPR best practice.

Stop collecting any information you don't actually need, and remove those fields from the database.

Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. accounts more than a year old that have never made a deposit.

Removing vs Blanking

In some cases it may be more desirable to remove all the personal metadata from a record in the Users dataset rather than delete it completely. This has pros and cons.

If you remove a user entirely who has deposited EPrints you may wish to reassign those deposits to an admin user, or a fake user called "left".

Last Login Time

Storing the last login time of a user can be useful to identify which users are active and which are not to help ensure data is not being stored longer than is necessary.

See GDPR/Last Login Time

Non-Active Users Report

TODO worth having a count of items linked with the user on the report, along with their statuses?

Delete User Action

  • Suggestion* Cron job to select all user accounts which have been inactive over a certain time threshold, check those users for affiliation to a publication (creator, editor, depositor etc.) which is either archive or buffer status and, where no connection exists, remove those user accounts.

Admin will be able to select user accounts from the report to delete (within the report or individually?)

Could be similar to: EPrints::DataObj::LoginTicket::expire_all - but with an $c->{'account_retention_time'} param taken away from the time() call?

Need to check logintickets too? Someone may not have actually logged in for ages, but maintained a browser session/cookies for a long time? (currently our longest 'active' login is ~20 days).

Add 'Agree to privacy and data statement' checkbox on registration or request forms

TODO - should be 'versioned' - so if the statement changes, the version that was agreed to can be shown. Possibly a set/namedset, with the available options limited to only one when rendered to a user?

Chris Gutteridge has written a blog post on GDPR with a few comments about the EPrints Request a Copy feature: http://blog.soton.ac.uk/webteam/2018/05/10/gdpr-preperations/#post-content-1793;char=5429-7155

EPrints Metadata

TODO

In actual documents

TODO

Request a copy Dataset

This collects an email address and a reason for requesting the document. Without intervention this can be stored indefinitely.

You should add a privacy statement to the form to say what you'll do with the information, add a reminder to the email sent to the authors that they shouldn't misuse the contact info, define a retention policy, automate the enforcement of the retention policy.

Adding a Privacy Statement

TODO

Altering the email to authors

TODO

Automating the removal of old requests

TODO

History dataset (used for access stats)

TODO