GDPR
This page has been created to gather information and share code snippets to help EPrints repositories handle GDPR responsibilities. Some of the issues around GDPR will be specific to a repository contents, but others are quite generic. nb. We're not lawyers!
As a general rule of thumb, this is about people's rights and privacy. Imagine a worst case scenario where the entire filesystem and database is made entirely public -- it's not a great situation, but consider what steps you could take to reduce the damage if that did happen - don't store any information about people that you don't actually need.
Contents
Areas of GDPR that may be relevant to EPrints repositories
- Giving people a clear privacy policy about how you will use data you gather about them
- Gaining consent for storing and processing data you gather about them
- Defining and enforcing a retention policy to remove information about people that is no longer required
- "Subject Access Requests" where someone wants to know everything you know about them
- "Right to Erasure" where people may request you remove information about them, but there are exceptions to this.
Information about people
EPrints potentially stores information about people in a number of places.
- Users Dataset
- EPrints Metadata
- In the revision XML files storing old versions of metadata.
- In actual documents
- In the "request a copy" Dataset
- Web access logs
- In the History Dataset
- The Access Dataset, IP Addresses can sometimes be used to identify an individual.
- In the Apache logs in the operating system (this will not be dealt with on this page, but should not be forgotten)
Users Dataset
This dataset is either populated with a web based sign up form, or in many cases, it is automatically built from an institutional accounts system. These cases need to be addressed in different ways.
Imported users Dataset
First of all, stop importing any fields you don't actually need and remove those fields from EPrints. nb. Removing the field in the config may not remove it from the database (citation required.. how the hell does that work?)
Next stop importing any users you just don't need -- if they can't do anything while logged in they should never exist in the Users dataset.
Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. have not appeared in the import script for more than 90 days and never made a deposit.
Web signup users Dataset
You should ensure that the privacy policy is up to date and in line with GDPR best practice.
Stop collecting any information you don't actually need, and remove those fields from the database.
Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. accounts more than a year old that have never made a deposit.
Removing vs Blanking
In some cases it may be more desirable to remove all the personal metadata from a record in the Users dataset rather than delete it completely. This has pros and cons.
If you remove a user entirely who has deposited EPrints you may wish to reassign those deposits to an admin user, or a fake user called "left".
Last Login Time
Storing the last login time of a user can be useful to identify which users are active and which are not to help ensure data is not being stored longer than is necessary.
Non-Active Users Report
TODO worth having a count of items linked with the user on the report, along with their statuses?
Delete User Action
- Suggestion* Cron job to select all user accounts which have been inactive over a certain time threshold, check those users for affiliation to a publication (creator, editor, depositor etc.) which is either archive or buffer status and, where no connection exists, remove those user accounts.
Admin will be able to select user accounts from the report to delete (within the report or individually?)
Could be similar to: EPrints::DataObj::LoginTicket::expire_all - but with an $c->{'account_retention_time'} param taken away from the time() call?
Need to check logintickets too? Someone may not have actually logged in for ages, but maintained a browser session/cookies for a long time? (currently our longest 'active' login is ~20 days).
Add 'Agree to privacy and data statement' checkbox on registration or request forms
TODO - should be 'versioned' - so if the statement changes, the version that was agreed to can be shown. Possibly a set/namedset, with the available options limited to only one when rendered to a user?
Chris Gutteridge has written a blog post on GDPR with a few comments about the EPrints Request a Copy feature: http://blog.soton.ac.uk/webteam/2018/05/10/gdpr-preperations/#post-content-1793;char=5429-7155
EPrints Metadata
TODO
In actual documents
TODO
Request a copy Dataset
This collects an email address and a reason for requesting the document. Without intervention this can be stored indefinitely.
You should add a privacy statement to the form to say what you'll do with the information, add a reminder to the email sent to the authors that they shouldn't misuse the contact info, define a retention policy, automate the enforcement of the retention policy.
Adding a Privacy Statement
TODO
Altering the email to authors
TODO
Automating the removal of old requests
TODO
History dataset (used for access stats)
TODO