Difference between revisions of "Simplified HTTPS Configuration"

From EPrints Documentation
Jump to: navigation, search
m (typos corrected, formatting improved)
m
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:Authentication]]
+
[[Category:Howto]]
  
''' * * * YOU MUST USE EPRINTS 3.4.1++ FOR THE CONFIGURATION BELOW TO BE GUARANTEED TO WORK * * *'''
+
''' * * * YOU MUST USE EPRINTS 3.4.1+ FOR THE CONFIGURATION BELOW TO WORK RELIABLY* * *'''.  For earlier versions of EPrints there is a more complex [[Legacy HTTPS Only Configuration]].
  
 
Trying to configure EPrints for HTTPS can be difficult and the way the code was previously written, even if you configured HTTPS correctly you could still have issues with mixed content pages, amongst other problems.  In EPrints 3.4.1 the underlying code has been improved so that you can configure, ''host'', ''port'', ''securehost'', and ''secureport'' in your archive's <code>cfg/cfg.d/10_core.pl</code> in three different ways to get the behaviour.  Other configuration options in this file should not need to be changed.
 
Trying to configure EPrints for HTTPS can be difficult and the way the code was previously written, even if you configured HTTPS correctly you could still have issues with mixed content pages, amongst other problems.  In EPrints 3.4.1 the underlying code has been improved so that you can configure, ''host'', ''port'', ''securehost'', and ''secureport'' in your archive's <code>cfg/cfg.d/10_core.pl</code> in three different ways to get the behaviour.  Other configuration options in this file should not need to be changed.
Line 33: Line 33:
 
== Issues and Troubleshooting ==
 
== Issues and Troubleshooting ==
 
Inevitably you may still encounter issues even if you use one of the configurations above, so it is advised you test this on a development or pre-production instance of your repository to check you get the behaviour you expect.
 
Inevitably you may still encounter issues even if you use one of the configurations above, so it is advised you test this on a development or pre-production instance of your repository to check you get the behaviour you expect.
 +
 +
=== Search Engine Indexing ===
 +
It has been observed in the past that some items may briefly disappear from the Google search index when switching to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]].  There is no way to guarantee this will not happen.  One way to try to mitigate and keep on top of this is to setup a Google Webmaster account and register your repository's hostname.  After a couple of days this should get populated with all the pages indexed for your repositories, if there are any missing you can submit these to Google to be re-added.
 +
 +
=== IRStats2 Blip in Downloads ===
 +
It has also been observed that repositories see a brief drop in downloads (and views) when switching to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]].  This may be partially due to search engine indexing but is most likely affected by the fact that bots and crawlers (including GoogleBot) will not follow redirects (i.e. from the HTTP URL they already had to the new HTTPS version) and therefore this will not count as a download.  [https://eprints.github.io/irstats2/ IRStats2] has multifarious ways of detecting bots but it is likely a large percentage of downloads will still be due to bots.  Therefore, in some ways the blip may actually give a more accurate picture of the amount of downloads from your repository.  However, looking a raw statistic is generally a bad idea, [https://eprints.github.io/irstats2/ IRStats2] is intended to show usage trends and differences more than absolute downloads or views.
 +
 +
=== Bazaar Plugins ===
 +
Effort has been taken to resolve any issues as a result of <tt>$c->{host}</tt> no longer being defined if you want to use HTTPS everywhere (with HTTPS redirects).  However, this can only be done to the extent of the core codebase.  Some [https://bazaar.eprints.org/ Bazaar] plugins may rely on <tt>$c->{host}</tt> being defined.  Below are the Bazaar plugins we are aware have issues.  This list is not exhaustive.
 +
 +
==== [http://bazaar.eprints.org/379/ Repository Links Plugin] ====
 +
The Repository Links plugin uses the EPrints URL to lookup where it is used on other repositories.  If your repository is configured as one of the remote repositories, then the master repository will have been storing your repository URLs as HTTP up to the point you switch over to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]].  When you do this it will start recording these URLs as HTTPS.
 +
 +
This creates two problems.  '''$c->{host}''' will not longer be set and this will prevent both the related links box for Repository Links being displayed on your repository's abstract pages but also the same box being displayed for the abstract pages for items on the master repository.  To fix this you will need to update '''$c->{host}''' to '''$c->{securehost}''' in '''/cgi/get_repo_links''' so the hostname will not be set in the request sent to the master repository.
 +
 +
Also. the protocol itself needs to be updated to '''https''' from '''http''' in the text that proceeds '''$c->{host}'''  in the same file.  This will ensure that it looks for https URLs which will now be stored on the master repository.  However, the master repository will still be storing http URLs for old items, so they will need to update all the URLs for your reposotory to https URLs to make the Repository Links related links box work on your repository, although the first change you fix the missing related links box on the master repository.
 +
 +
It is also probably worth updating the protocol used for making HTTP requests to the master repository to use https as well, assuming it is configured to support HTTPS.
 +
 +
It is unclear if and what changes would be necessary if your repository is the master repository for the Repository Links plugin.  However, as a minimum you should make the changes described above and check any other Repository Link plugin files for the use of '''$c->{host}''' to '''$c->{securehost}''' and ensure the https is used as the protocol wherever http is currently being used.
  
 
=== EPrint URI Change ===
 
=== EPrint URI Change ===
Line 52: Line 72:
  
 
OAI-PMH (e.g. <nowiki>http://example.eprints.org/cgi/oai2</nowiki> and <nowiki>https://example.eprints.org/cgi/oai2</nowiki>) provide different relations (http or https) for a publication but the OAI identifier is protocol independent and therefore stays the same.  Therefore, third party applications that make use of OAI-PMH should not be affected if they harvest as the protocol specifies.
 
OAI-PMH (e.g. <nowiki>http://example.eprints.org/cgi/oai2</nowiki> and <nowiki>https://example.eprints.org/cgi/oai2</nowiki>) provide different relations (http or https) for a publication but the OAI identifier is protocol independent and therefore stays the same.  Therefore, third party applications that make use of OAI-PMH should not be affected if they harvest as the protocol specifies.
 
=== Search Engine Indexing ===
 
It has been observed in the past that some items may briefly disappear from the Google search index when switching to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]].  There is no way to guarantee this will not happen.  One way to try to mitigate and keep on top of this is to setup a Google Webmaster account and register your repository's hostname.  After a couple of days this should get populated with all the pages indexed for your repositories, if there are any missing you can submit these to Google to be re-added.
 
 
=== IRStats2 Blip in Downloads ===
 
It has also been observed that repositories see a brief drop in downloads (and views) when switching to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]].  This may be partially due to search engine indexing but is most likely affected by the fact that bots and crawlers (including GoogleBot) will not follow redirects (i.e. from the HTTP URL they already had to the new HTTPS version) and therefore this will not count as a download.  [https://eprints.github.io/irstats2/ IRStats2] has multifarious ways of detecting bots but it is likely a large percentage of downloads will still be due to bots.  Therefore, in some ways the blip may actually give a more accurate picture of the amount of downloads from your repository.  However, looking a raw statistic is generally a bad idea, [https://eprints.github.io/irstats2/ IRStats2] is intended to show usage trends and differences more than absolute downloads or views.
 

Latest revision as of 08:46, 20 May 2022


* * * YOU MUST USE EPRINTS 3.4.1+ FOR THE CONFIGURATION BELOW TO WORK RELIABLY* * *. For earlier versions of EPrints there is a more complex Legacy HTTPS Only Configuration.

Trying to configure EPrints for HTTPS can be difficult and the way the code was previously written, even if you configured HTTPS correctly you could still have issues with mixed content pages, amongst other problems. In EPrints 3.4.1 the underlying code has been improved so that you can configure, host, port, securehost, and secureport in your archive's cfg/cfg.d/10_core.pl in three different ways to get the behaviour. Other configuration options in this file should not need to be changed.

Configurations

Make sure you remove or disable your archive's cfg/cfg.d/https.pl if it exists as it may override the configuration below. Once you have updated your configuration you must run generate_apacheconf to regenerate configuration for Apache before restarting the web server.

HTTP Only

It is advised you avoid using this configuration unless you are developing a repository on a non-publicly accessible web host.

$c->{host} = 'example.eprints.org';
$c->{port} = 80;
$c->{securehost} = undef;
$c->{secureport} = undef;

HTTPS When You Login

This is the current default for EPrints. All publicly accessible pages will use HTTP by default (but still be accessible over HTTPS if you modify the URL) and the login page and all login restricted pages will use HTTPS or be redirected from HTTP.

$c->{host} = 'example.eprints.org';
$c->{port} = 80;
$c->{securehost} = $c->{host};
$c->{secureport} = 443;

HTTPS Only

This ensures that no page (image, CSS, JavaScript file, etc.) will be returned over HTTP and if requested it will be redirected to HTTPS.

You may also want to edit the archive's ssl/securevhost.conf to add the HSTS header.

$c->{host} = undef;
$c->{port} = 80;
$c->{securehost} = 'example.eprints.org';
$c->{secureport} = 443;

Issues and Troubleshooting

Inevitably you may still encounter issues even if you use one of the configurations above, so it is advised you test this on a development or pre-production instance of your repository to check you get the behaviour you expect.

Search Engine Indexing

It has been observed in the past that some items may briefly disappear from the Google search index when switching to HTTPS Only. There is no way to guarantee this will not happen. One way to try to mitigate and keep on top of this is to setup a Google Webmaster account and register your repository's hostname. After a couple of days this should get populated with all the pages indexed for your repositories, if there are any missing you can submit these to Google to be re-added.

IRStats2 Blip in Downloads

It has also been observed that repositories see a brief drop in downloads (and views) when switching to HTTPS Only. This may be partially due to search engine indexing but is most likely affected by the fact that bots and crawlers (including GoogleBot) will not follow redirects (i.e. from the HTTP URL they already had to the new HTTPS version) and therefore this will not count as a download. IRStats2 has multifarious ways of detecting bots but it is likely a large percentage of downloads will still be due to bots. Therefore, in some ways the blip may actually give a more accurate picture of the amount of downloads from your repository. However, looking a raw statistic is generally a bad idea, IRStats2 is intended to show usage trends and differences more than absolute downloads or views.

Bazaar Plugins

Effort has been taken to resolve any issues as a result of $c->{host} no longer being defined if you want to use HTTPS everywhere (with HTTPS redirects). However, this can only be done to the extent of the core codebase. Some Bazaar plugins may rely on $c->{host} being defined. Below are the Bazaar plugins we are aware have issues. This list is not exhaustive.

Repository Links Plugin

The Repository Links plugin uses the EPrints URL to lookup where it is used on other repositories. If your repository is configured as one of the remote repositories, then the master repository will have been storing your repository URLs as HTTP up to the point you switch over to HTTPS Only. When you do this it will start recording these URLs as HTTPS.

This creates two problems. $c->{host} will not longer be set and this will prevent both the related links box for Repository Links being displayed on your repository's abstract pages but also the same box being displayed for the abstract pages for items on the master repository. To fix this you will need to update $c->{host} to $c->{securehost} in /cgi/get_repo_links so the hostname will not be set in the request sent to the master repository.

Also. the protocol itself needs to be updated to https from http in the text that proceeds $c->{host} in the same file. This will ensure that it looks for https URLs which will now be stored on the master repository. However, the master repository will still be storing http URLs for old items, so they will need to update all the URLs for your reposotory to https URLs to make the Repository Links related links box work on your repository, although the first change you fix the missing related links box on the master repository.

It is also probably worth updating the protocol used for making HTTP requests to the master repository to use https as well, assuming it is configured to support HTTPS.

It is unclear if and what changes would be necessary if your repository is the master repository for the Repository Links plugin. However, as a minimum you should make the changes described above and check any other Repository Link plugin files for the use of $c->{host} to $c->{securehost} and ensure the https is used as the protocol wherever http is currently being used.

EPrint URI Change

When an EPrint made live it will acquire a URI in the form

http://example.eprints.org/eprint/id/1234

If you switch over to HTTPS Only the above URI will be updated (if you refresh abstracts) to

https://example.eprints.org/eprint/id/1234

For most repositories this will not be an issue but if your repository is harvested by a third party application, it may rely on the URI as a unique identifier and if this changes it may that all the EPrints are new as none of the URIs are the same as before.

For third party applications that integrate through the Bazaar (EThoS, PIRUS, Symplectic Repository Tools, etc.) no problems relating to this have been identified. However, if your repository has a bespoke third party application this may be affected and is something you should test beforehand if possible but as soon as you go live with the new configuration otherwise.

If you need to ensure your EPrint URIs do not change you can add the uri_url configuration option at the end of your archive's 10_core.pl configuration as follows:

$c->{uri_url} = "http://" . $c->{securehost};

OAI-PMH (e.g. http://example.eprints.org/cgi/oai2 and https://example.eprints.org/cgi/oai2) provide different relations (http or https) for a publication but the OAI identifier is protocol independent and therefore stays the same. Therefore, third party applications that make use of OAI-PMH should not be affected if they harvest as the protocol specifies.