Difference between revisions of "EPrints4 Core Roadmap"

From EPrints Documentation
Jump to: navigation, search
m (Change category Developer to Contribute)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
[[Category:Eprints4]]
 +
[[Category:Contribute]]
 +
 
=EPrints4 Core Development=
 
=EPrints4 Core Development=
  
Line 71: Line 74:
  
 
* DataSet/Objects are collections of metadata fields and offer an API for easily manipulating objects/data.
 
* DataSet/Objects are collections of metadata fields and offer an API for easily manipulating objects/data.
 +
* Dataset offers a number of properties to handle common fields/concepts in a consistent manner across different datasets. For instance the "lastmod" property which is an internal field always set to when the object was last modified or "history" which keeps track of the modifications of the object. Considered properties:
 +
** "states": the graph of states for that object. Similar to "eprint_status" for "eprint" objects in EPrints2/3. Automates and generalises state transfer for objects.
 +
** "revision": keep the revision number which is incremented each time that object is modified (default: ON)
 +
** "acl": enables the ACL related fields and permission checking (TODO)
 +
** "cache": enables the use of memcached (default: ON if globally enabled) - if enabled, keep a copy of data objects in memory rather than reading objects from DB.
 +
** "read-only": makes that data-set read only and forbids any modification/deletion (default: OFF)
 +
** "history": keeps an XML diff of the modifications made to that object. (TODO)
 +
** "lastmod": keeps a last modified field
 +
** "datestamp": the date at which the object was created
 +
* EPrints::DataObj now contains all required information so that custom datasets do not need to have their explicit implementation (EPrints::DataObj::Foo package)
 +
 +
Question: EPrints2/3 does not offer a generalised manner to reference a sub-dataset: "archive" is used as a short-cut for "eprint:archive" but because the dataset naming is global you can't re-use "archive" for other datasets. So we need a common way to reference a dataset. eprint:inbox? eprint/inbox? eprint_inbox?
  
* Dataset offers a number of properties to handle common fields/concepts in a consistent manner across different datasets. For instance the "lastmod" property which is an internal field always set to when the object was last modified or "history" which keeps track of the modifications of the object. Considered properties:
+
==Data retrieval and searching==
  
** "states": the graph of states for that object. Similar to "eprint_status" for "eprint" objects in EPrints2/3. Automates and generalises state transfer for objects.
+
I think there are two slightly different aspects of data retrieval:
  
** "revision": keep the revision number which is incremented each time that object is modified (default: ON)
+
* an internal mechanism for retrieving data straight from the database, for instance:
 +
** all data objects belonging to a dataset
 +
** all data objects owned by a user
 +
** generally the retrieval of data objects based on an EXACT constraint or set of constraints.
 +
** possibility to order the data objects by a given meta field (i.e. a DB column).
 +
* a user search optimised for helping users to find content they're interested to. This will likely be done by Xapian. Because this concept can become difficult to implement, it is better to use a specialist tool such as Xapian to do it. Includes the following scenarios:
 +
** retrieval of all data objects belonging to a dataset (via the Xapian indexes which may be out-of-date or incomplete since objects are indexed by the indexer in a asynchronous manner)
 +
** parsing complex user queries using e.g. boolean operators
  
** "acl": enables the ACL related fields and permission checking (TODO)
+
One question is how to label these two slightly different searching mechanism?
  
** "cache": enables the use of memcached (default: ON if globally enabled) - if enabled, keep a copy of data objects in memory rather than reading objects from DB.
+
===Internal Data retrieval===
  
** "read-only": makes that data-set read only and forbids any modification/deletion (default: OFF)
+
* Only supports EXACT matching or other SQL-derived tests and operators e.g. NOT NULL (aka is_set)
 +
* Tightly linked to the concept of EPrints::List
 +
* May be used on the UI to display lists of items owned by a user ETC but shouldn't be used in real user searches
  
** "history": keeps an XML diff of the modifications made to that object. (TODO)
+
===User Searches===
  
** "lastmod": keeps a last modified field
+
* Requires the creation and management of indexes (because it is the indexes which are searched)
  
** "datestamp": the date at which the object was created
+
==CLI==
  
* EPrints::DataObj now contains all required information so that custom datasets do not need to have their explicit implementation (EPrints::DataObj::Foo package)
+
* Use of a SYSTEM user for CLI operation/scripts? If not, ACL type of checks must online be done/enabled for Web requests
  
==Data retrieval==
+
==Misc==

Latest revision as of 23:53, 11 September 2018


EPrints4 Core Development

Trying to divide the work into smaller tasks (and questions where relevant).

Request Handler / Page Controllers

This deals with how HTTP requests are handled by EPrints: when a request is made by a client (curl, browser..) a number of headers/parameters must be processed (e.g. Accept, Content-Type, EPrints' cookies, auth etc). This is the entry point of any web requests to the system. The main module is EPrints::Apache::Handler (formerly known as Apache::Rewrite).

  • Basic security aspects
    • detect invalid requests/URL paths (e.g. containing a dot '.')
  • Low-level filters for processing of:
    • Cookies (eprints_session, eprints_lang)
    • URL encoding
    • specific HTTP headers: Accept, Content-Type, Content-Length, Content-Range, Content-MD5 (?), ETag (?)
  • System initialisation:
    • Repository object init (with language selection, via cookie/default lang)
    • Protocol (HTTP, HTTPS?)
  • Low-level redirects or custom actions via Trigger
  • Call for Controller(s) to handle the request (e.g. CRUD if /id/..., UI, storage (file delivery) etc.) via Trigger
  • clean-up/post-processing if needed (e.g. setting Cache headers etc).

Security

Security (authentication and authorisation) must be handled by the Page Controller. They decide if auth is required at all (for instance a GET on an object via CRUD is usually allowed without auth). If auth may be required, the PerlAccessHandler stack must be defined.

It is worth noting that authentication is likely to be common to most Page Controllers. The typical authentication mechanisms are: Basic Auth (via HTTP Header) and Cookie/Session-based Auth. Session-based is mostly handled internally (via LoginTicket) but the way the credentials are retrieved can be customised (CAS, LDAP...).

Authorisation, on the other hand will be defined by the Page Controller. If the request is trying to modify an object, the Controller needs to check that the logged-in user has the appropriate rights/roles.

Filters must be called using Trigger(s) or similar mechanism to avoid having hard-coded modules in the Request handler. It is then easy to select, add, remove filters at run-time.

Database

  • Must contain any SQL handling. Custom SQL in scripts is bad and shows lacks in the DB API. This must handle the usual operations required by the system such as data retrieval, modification etc and any DB optimisation (commonly known as hacks).
  • The default DB engine is likely to be MySQL/InnoDB with support for transactions.
  • The DB layer is "very" low-level and doesn't check security concerns (e.g. can the user modifies that object?). This must be handled by the above layer, usually DataObj and MetaField.

TODO: an audit of the current DB code to remove un-needed/deprecated methods, to re-organise the required existing methods.


Data Representation

This defines properties for data objects (for example the "name" of a user) and offers a layer between the raw data (handled by Database) and the application/views using the data (e.g. CRUD, "the" UI, Export/Import plug-ins etc).

Important: data objects and metadata fields do not know how to render themselves on a UI. This is one of the main design aspect of EPrints4 as to remove dependencies of the UI on the data model.

Metadata fields

  • make sure the data is valid, according to the type's constraints. For instance, you cannot store a string into an integer
  • maps any higher level type (e.g. URL) to a low-level DB type (e.g. varchar)
  • must provide a clean and clear API to easily modify data (set_value, add_value etc)
  • the type "file" whilst not being a piece of metadata directly is nonetheless a valid field - it is the only type that links data to a stored file (rather than being purely DB data).
  • XML/SAX handlers are likely to be kept at the Meta-Field level (because this makes sense)
  • aspects such as indexing, generating order values are more likely to be included in the Search components (so via Xapian for instance)
  • review of the existing types is needed (40ish types at the moment) and non-core types must be removed (recaptcha, compound...)

Data set / Data objects

  • DataSet/Objects are collections of metadata fields and offer an API for easily manipulating objects/data.
  • Dataset offers a number of properties to handle common fields/concepts in a consistent manner across different datasets. For instance the "lastmod" property which is an internal field always set to when the object was last modified or "history" which keeps track of the modifications of the object. Considered properties:
    • "states": the graph of states for that object. Similar to "eprint_status" for "eprint" objects in EPrints2/3. Automates and generalises state transfer for objects.
    • "revision": keep the revision number which is incremented each time that object is modified (default: ON)
    • "acl": enables the ACL related fields and permission checking (TODO)
    • "cache": enables the use of memcached (default: ON if globally enabled) - if enabled, keep a copy of data objects in memory rather than reading objects from DB.
    • "read-only": makes that data-set read only and forbids any modification/deletion (default: OFF)
    • "history": keeps an XML diff of the modifications made to that object. (TODO)
    • "lastmod": keeps a last modified field
    • "datestamp": the date at which the object was created
  • EPrints::DataObj now contains all required information so that custom datasets do not need to have their explicit implementation (EPrints::DataObj::Foo package)

Question: EPrints2/3 does not offer a generalised manner to reference a sub-dataset: "archive" is used as a short-cut for "eprint:archive" but because the dataset naming is global you can't re-use "archive" for other datasets. So we need a common way to reference a dataset. eprint:inbox? eprint/inbox? eprint_inbox?

Data retrieval and searching

I think there are two slightly different aspects of data retrieval:

  • an internal mechanism for retrieving data straight from the database, for instance:
    • all data objects belonging to a dataset
    • all data objects owned by a user
    • generally the retrieval of data objects based on an EXACT constraint or set of constraints.
    • possibility to order the data objects by a given meta field (i.e. a DB column).
  • a user search optimised for helping users to find content they're interested to. This will likely be done by Xapian. Because this concept can become difficult to implement, it is better to use a specialist tool such as Xapian to do it. Includes the following scenarios:
    • retrieval of all data objects belonging to a dataset (via the Xapian indexes which may be out-of-date or incomplete since objects are indexed by the indexer in a asynchronous manner)
    • parsing complex user queries using e.g. boolean operators

One question is how to label these two slightly different searching mechanism?

Internal Data retrieval

  • Only supports EXACT matching or other SQL-derived tests and operators e.g. NOT NULL (aka is_set)
  • Tightly linked to the concept of EPrints::List
  • May be used on the UI to display lists of items owned by a user ETC but shouldn't be used in real user searches

User Searches

  • Requires the creation and management of indexes (because it is the indexes which are searched)

CLI

  • Use of a SYSTEM user for CLI operation/scripts? If not, ACL type of checks must online be done/enabled for Web requests

Misc