From EPrints Documentation
Revision as of 10:27, 1 June 2018 by Cjg (talk | contribs)
Jump to: navigation, search

This page has been created to gather information and share code snippets to help EPrints repositories handle GDPR responsibilities. Some of the issues around GDPR will be specific to a repository contents, but others are quite generic. nb. We're not lawyers!

As a general rule of thumb, this is about people's rights and privacy. Imagine a worst case scenario where the entire filesystem and database is made entirely public -- it's not a great situation, but consider what steps you could take to reduce the damage if that did happen - don't store any information about people that you don't actually need.

Areas of GDPR that may be relevant to EPrints repositories

  • Giving people a clear privacy policy about how you will use data you gather about them
  • Gaining consent for storing and processing data you gather about them
  • Defining and enforcing a retention policy to remove information about people that is no longer required
  • "Subject Access Requests" where someone wants to know everything you know about them
  • "Right to Erasure" where people may request you remove information about them, but there are exceptions to this.

Information about people

EPrints potentially stores information about people in a number of places.

  • Users Dataset
  • EPrints Metadata
    • In the revision XML files storing old versions of metadata.
  • In actual documents
  • In the "request a copy" Dataset
  • Web access logs
    • In the History Dataset
    • In the Apache logs in the operating system

Users Dataset

This dataset is either populated with a web based sign up form, or in many cases, it is automatically built from an institutional accounts system. These cases need to be addressed in different ways.

Imported users Dataset

First of all, stop importing any fields you don't actually need and remove those fields from EPrints. nb. Removing the field in the config may not remove it from the database (citation required.. how the hell does that work?)

Next stop importing any users you just don't need -- if they can't do anything while logged in they should never exist in the Users dataset.

Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. have not appeared in the import script for more than 90 days and never made a deposit.

Web signup users Dataset

You should ensure that the privacy policy is up to date and in line with GDPR best practice.

Stop collecting any information you don't actually need, and remove those fields from the database.

Next implement a data retention policy. This will be a script that removes or blanks the data of users that match a criteria. eg. accounts more than a year old that have never made a deposit.

Removing vs Blanking

In some cases it may be more desirable to remove all the personal metadata from a record in the Users dataset rather than delete it completely. This has pros and cons.

If you remove a user entirely who has deposited EPrints you may wish to reassign those deposits to an admin user, or a fake user called "left".

Last Login Time

Storing the last login time of a user can be useful to identify which users are active and which are not to help ensure data is not being stored longer than is necessary.

See GDPR/Last Login Time

Non-Active Users Report

TODO worth having a count of items linked with the user on the report, along with their statuses?

Delete User Action

  • Suggestion* Cron job to select all user accounts which have been inactive over a certain time threshold, check those users for affiliation to a publication (creator, editor, depositor etc.) which is either archive or buffer status and, where no connection exists, remove those user accounts.

Admin will be able to select user accounts from the report to delete (within the report or individually?)

Could be similar to: EPrints::DataObj::LoginTicket::expire_all - but with an $c->{'account_retention_time'} param taken away from the time() call?

Need to check logintickets too? Someone may not have actually logged in for ages, but maintained a browser session/cookies for a long time? (currently our longest 'active' login is ~20 days).

Add 'Agree to privacy and data statement' checkbox on registration or request forms

TODO - should be 'versioned' - so if the statement changes, the version that was agreed to can be shown. Possibly a set/namedset, with the available options limited to only one when rendered to a user?

Chris Gutteridge has written a blog post on GDPR with a few comments about the EPrints Request a Copy feature:;char=5429-7155