Anatomy of a request

From EPrints Documentation
Revision as of 10:23, 24 October 2014 by Libjlrs (Talk | contribs)

Jump to: navigation, search

THIS PAGE IS UNDER CONSTRUCTION! JLRS 2014-10-23

This is a description of how EPrints and Apache handles an incoming request. Understanding this flow helps understand how an Access Control layer can be added to the system.

I will assume that you know how to locate a perl module file from the module name (e.g. EPrints::Apache::Rewrite will probably be ~/perl_lib/EPrints/Apache/Rewrite.pm, although this is not always the case!).

Flow of a request

Below are relevant parts of config files and perl modules that are used with when processing a request. The request will generally be dealt with by the EPrints::Apache::Rewrite module, and farmed out from there. Hoe the request reaches this module is also explained below.

Apache core config ~/cfg/apache.conf

PerlSwitches -I/home/eprints/eprints-3.3.12/perl_lib
PerlModule EPrints
PerlPostConfigHandler +EPrints::post_config_handler

The post_config_handler does some sanity checks on the EPrints setup (e.g. is Apache listening to the ports that the repositories are configured to work under) See: http://perl.apache.org/docs/2.0/user/handlers/server.html#C_PerlPostConfigHandler_ for more info about the post_config_handler

Apache repository config ~/cfg/apache/ARCHIVEID.conf

<VirtualHost *:80>
...
  PerlTransHandler +EPrints::Apache::Rewrite
 
</VirtualHost>

This leads us to the backbone of EPrints - the Rewrite module - where URL_REWRITE_* triggers are called; content negotiation can happen, as well as many other wonderous things!

EPrints::Apache::Rewrite module

This explanation, and line numbers are taken from a specific version of the file: https://github.com/eprints/eprints/blob/88a36fcf1f17c7a04e60455d374b617709f7461d/perl_lib/EPrints/Apache/Rewrite.pm Obviously this file will change over time, so it's worth comparing it with the version you are using, and possibly other versions on GitHub e.g. https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Apache/Rewrite.pm

It's worth taking a few minutes to look at this file - specifically the sub handler.

EP_TRIGGER_URL_REWRITE

Line 123 calls the 'EP_TRIGGER_URL_REWRITE' trigger:

$repository->run_trigger( EPrints::Const::EP_TRIGGER_URL_REWRITE,
	request => $r,
	   lang => $lang,    # en
	   args => $args,    # "" or "?foo=bar"
	urlpath => $urlpath, # "" or "/subdir"
	cgipath => $cgipath, # /cgi or /subdir/cgi
	    uri => $uri,     # /foo/bar
	 secure => $secure,  # boolean
    return_code => \$rc,     # set to trigger a return
);

If any of the EP_TRIGGER_URL_REWRITE's return a return_code, this is returned. Information on triggers

CGI scripts

Line 157 deals with CGI scripts - redirecting to HTTPS if necessary. It looks for the CGI scripts in three locations Lines 195-199:

  • ~/archives/ARCHIVEID/cgi/
  • ~/site_lib/cgi/
  • ~/cgi/

If the cgi script is a 'user' script, it also defines a PerlAccessHandler Lines 214-220

if( $uri =~ m! ^/users\b !x )
{
	$r->push_handlers(PerlAccessHandler => [
		\&EPrints::Apache::Auth::authen,
		\&EPrints::Apache::Auth::authz
	] );
}

SWORD servicedocument

Lines 233-258 deal with the 'Sword' service document, via the CRUD interface.

REST interface

Lines 281-290 handle the REST interface, via EPrints::Apache::REST

EPrints URIs

Lines 292-374: EPrint URIs are normally of the form http://repository.blah/id/.... There are three main if blocks in this section that use regex's to match the URI:

  • Line 293 $uri =~ m! ^$urlpath/id/(repository|dump)$ !x matches two cases.
  • Line 318 $uri =~ m! ^$urlpath/id/([^\/]+)/(ext-.*)$ !x matches ??? Some RDF type stuff!? 'event/ext-foo'..?
  • Lines 345-347 (shown on one line here) $uri =~ s! ^$urlpath/id/(?: contents | ([^/]+)(?:/([^/]+)(?:/([^/]+))?)? )$ !!x matches '/' seperated dataset, dataobjid and field - or 'contents'. Request is passed to CRUD handler, and uses it's authen/authz:
$r->push_handlers(PerlAccessHandler => [
	sub { $crud->authen },
	sub { $crud->authz },
] );

EPrint IDs, Documents

Under construction!

Lines 377-493 This block of code looks for requests starting with e.g. http://repository.blah/123 - where '123' is an EPrintID. There are some redirects in this block to account for older URL that may be requested that had EPrintIDs and/or document positions zero-padded http://repository.blah/00000123 or http://repository.blah/00000123/01/Document.txt.

Each subsequent match on the $uri consumes part of it - e.g.

Line 377

$uri =~ s! ^$urlpath/(0*)([1-9][0-9]*)\b !!x

will remove the EPrintID from the start of $uri.

Line 398

$uri =~ s! ^/(0*)([1-9][0-9]*)\b !!x

will match elements after the EPrintID in the original URL - matching '45' in

  • http://repository.blah/123/45/Document.txt or
  • http://repository.blah/123/45.hassmallThumbnailVersion/Document.txt

(the second example shows the use of a 'relationship' that is processed using a EP_TRIGGER_DOC_URL_REWRITE trigger).

Lines 418-419 may be a bit confusing at first glance.

$uri =~ s! ^([^/]*)/ !!x;
my @relations = grep { length($_) } split /\./, $1;

They deal with document relationships - that are of the form .../DocID.relationship1.relationship2.relationshipN/....

For documents, thumbnails are presented as related documents, the relationship is e.g. 'hassmallThumbnailVersion'.

The first line gets anything from the start of $uri (the DocID already having been removed by line 398), to the next '/'.

The second line (possibly the least readable line of code in EPrints?):

  • takes the captured match ($1): .relationship1.relationship2.relationshipN
  • splits on '.'s: "","relationship1","relationship2","relationshipN"
  • for each of the split values (referenced as $_): grep for the length of the value. This effectively strips out empty elements (length = 0, grep doesn't return the value).
  • @relationships = "relationship1","relationship2","relationshipN"

TODO (some should be seperate pages)

  • Explain flow of Rewrite
    • triggers
    • cgi
    • content negotiation
    • CRUD
    • ???
  • permit on DataObj
    • can_request_view / can_user_view
  • summary pages (content neg/ URL rewrite)
  • DOI - 5 metadata elements
  • 40x handling
Access Control Layer