Difference between revisions of "Simplified HTTPS Configuration"

From EPrints Documentation
Jump to: navigation, search
(Added header above config subheaders)
m (typos corrected, formatting improved)
Line 1: Line 1:
''' * * * YOU MUST USE EPRINTS 3.4.1 OR GREATER FOR THE CONFIGURATION BELOW TO BE GUARANTEED TO WORK * * *'''
+
[[Category:Authentication]]
  
Trying to configure EPrints for HTTPS can be difficult and the way the code was previously written, even if you configured HTTPS correctly you could still have issues with mixed content pages, amongsot other problems.  In EPrints 3.4.1+ the underlying code has been improved so that you can configure, ''host'', ''port'', ''securehost'', and ''secureport'' in your archive's '''cfg/cfg.d/10_core.pl''' in three different ways to get the behaviour.  Other configuration options in this file should not need to be changed.
+
''' * * * YOU MUST USE EPRINTS 3.4.1++ FOR THE CONFIGURATION BELOW TO BE GUARANTEED TO WORK * * *'''
 +
 
 +
Trying to configure EPrints for HTTPS can be difficult and the way the code was previously written, even if you configured HTTPS correctly you could still have issues with mixed content pages, amongst other problems.  In EPrints 3.4.1 the underlying code has been improved so that you can configure, ''host'', ''port'', ''securehost'', and ''secureport'' in your archive's <code>cfg/cfg.d/10_core.pl</code> in three different ways to get the behaviour.  Other configuration options in this file should not need to be changed.
  
 
== Configurations ==
 
== Configurations ==
'''Make sure you remove or disabled your archive's ''cfg/cfg.d/https.pl'' if it exists as it may override these configuration below.  Once you have updated your configuration you must run ''generate_apacheconf'' to regenerate configuration for Apache before restarting the web server.'''
+
'''Make sure you remove or disable your archive's <code>cfg/cfg.d/https.pl</code> if it exists as it may override the configuration below.  Once you have updated your configuration you must run <code>generate_apacheconf</code> to regenerate configuration for Apache before restarting the web server.'''
  
 
=== HTTP Only ===
 
=== HTTP Only ===
It is advised you avoid using this configuration unless you developing a repository on a non-publicly accessible web host.
+
It is advised you avoid using this configuration unless you are developing a repository on a non-publicly accessible web host.
 
  $c->{host} = 'example.eprints.org';
 
  $c->{host} = 'example.eprints.org';
 
  $c->{port} = 80;
 
  $c->{port} = 80;
Line 14: Line 16:
  
 
=== HTTPS When You Login ===
 
=== HTTPS When You Login ===
This is the current default for EPrints.  All publicly accessible pages will use HTTP be default (but still be accessible over HTTPS if you modify the URL) and the login page and all login restricted pages will use HTTPS or be redirected from HTTP.
+
This is the current default for EPrints.  All publicly accessible pages will use HTTP by default (but still be accessible over HTTPS if you modify the URL) and the login page and all login restricted pages will use HTTPS or be redirected from HTTP.
 
  $c->{host} = 'example.eprints.org';
 
  $c->{host} = 'example.eprints.org';
 
  $c->{port} = 80;
 
  $c->{port} = 80;
Line 21: Line 23:
  
 
=== HTTPS Only ===
 
=== HTTPS Only ===
This ensure that now page (image, CSS, JavaScript file, etc.) will be return over HTTP and if requested it will redirected to HTTPS.
+
This ensures that no page (image, CSS, JavaScript file, etc.) will be returned over HTTP and if requested it will be redirected to HTTPS.
  
You may also want to edit the archive's ssl/securevhost.conf to add the [https://wiki.eprints.org/w/HTTPS-only_and_HSTS#Add_the_HSTS_header HSTS header].
+
You may also want to edit the archive's <code>ssl/securevhost.conf</code> to add the [[HTTPS-only_and_HSTS#Add_the_HSTS_header | HSTS header]].
 
  $c->{host} = undef;
 
  $c->{host} = undef;
 
  $c->{port} = 80;
 
  $c->{port} = 80;
Line 30: Line 32:
  
 
== Issues and Troubleshooting ==
 
== Issues and Troubleshooting ==
Inevitably you may still encounter issues even if you use one of the configuration above, so it is advised you test this on a development or pre-production instance of your repository to check you get the behaviour you expect.
+
Inevitably you may still encounter issues even if you use one of the configurations above, so it is advised you test this on a development or pre-production instance of your repository to check you get the behaviour you expect.
  
 
=== EPrint URI Change ===
 
=== EPrint URI Change ===
When an EPrint made live it will acquire a URI in the form:
+
When an EPrint made live it will acquire a URI in the form
  
  http://example.eprints.org/eprint/id/1234
+
  <nowiki>http://example.eprints.org/eprint/id/1234</nowiki>
  
If you switch over to ''HTTPS Only'' the abive URI will be updated (if you refresh abstracts) to:
+
If you switch over to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]] the above URI will be updated (if you refresh abstracts) to
  
  https://example.eprints.org/eprint/id/1234
+
  <nowiki>https://example.eprints.org/eprint/id/1234</nowiki>
  
For most repositories this will not be an issue but if your repository is harvested by a third party application, it may rely on the URI as a unique identifier and if this change it may this all the EPrints are new as none of the URIs are the same as before.
+
For most repositories this will not be an issue but if your repository is harvested by a third party application, it may rely on the URI as a unique identifier and if this changes it may that all the EPrints are new as none of the URIs are the same as before.
  
For third party applications that integrate through the Bazaar (EThoS, PIRUS, Symplectic Repository Tools, etc.) no problems relating to this have been identified.  However, if you repository has a bespoke third party application this may be affected and is something you should test beforehand if possible but as soon as you go live with the new configuration otherwise.
+
For third party applications that integrate through the [http://bazaar.eprints.org Bazaar] (EThoS, PIRUS, Symplectic Repository Tools, etc.) no problems relating to this have been identified.  However, if your repository has a bespoke third party application this may be affected and is something you should test beforehand if possible but as soon as you go live with the new configuration otherwise.
  
If you need to ensure your EPrint URIs do not change you can add the '''uri_url''' configuration option at the end of your archive's 10_core.pl configuration as follows:
+
If you need to ensure your EPrint URIs do not change you can add the <code>uri_url</code> configuration option at the end of your archive's <code>10_core.pl</code> configuration as follows:
  
 
  $c->{uri_url} = "http://" . $c->{securehost};
 
  $c->{uri_url} = "http://" . $c->{securehost};
  
OAI-PMH (e.g. http://example.eprints.org/cgi/oai2 and https://example.eprints.org/cgi/oai2) provide different relations (http or https) for a publication but the OAI identifier is protocol independent and therefore stays the same.  Therefore, third party applications that make use of OAI-PMH should not be affected if they harvest as the protocol specifies.
+
OAI-PMH (e.g. <nowiki>http://example.eprints.org/cgi/oai2</nowiki> and <nowiki>https://example.eprints.org/cgi/oai2</nowiki>) provide different relations (http or https) for a publication but the OAI identifier is protocol independent and therefore stays the same.  Therefore, third party applications that make use of OAI-PMH should not be affected if they harvest as the protocol specifies.
  
 
=== Search Engine Indexing ===
 
=== Search Engine Indexing ===
It has been observed in the past that some items may briefly disappear from the Google search index when switching to ''HTTPS Only''.  There is no way to guarantee this will not happen.  One way to try to mitigate and keep on top of this is to setup a Google Webmaster account and register your repository's hostname.  After a couple of days this should get populated with all the pages indexed for your repositories, if there are any missing you can submit these to Google to be re-added.
+
It has been observed in the past that some items may briefly disappear from the Google search index when switching to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]].  There is no way to guarantee this will not happen.  One way to try to mitigate and keep on top of this is to setup a Google Webmaster account and register your repository's hostname.  After a couple of days this should get populated with all the pages indexed for your repositories, if there are any missing you can submit these to Google to be re-added.
  
 
=== IRStats2 Blip in Downloads ===
 
=== IRStats2 Blip in Downloads ===
It has also been observed that repositories see a brief drop in downloads (and views) when switching to ''HTTPS Only''.  This may be partially due to search engine indexing but is most likely affected by the fact that bots and crawlers (including GoogleBot) will not follow redirects (i.e. from the HTTP URL they already had to the new HTTPS version) and therefore this will not count as a download.  IRStats2 has multifarious ways of detecting bots but it is likely a large percentage of downloads will still be due to bots.  Therefore, in some ways the blip may actually give a more accurate pictire of the amount of downloads from your repository.  However, looking a raw statistic is generally a bad idea, IRStats3 is intended to show usage trends and differences more than absolute downloads or views.
+
It has also been observed that repositories see a brief drop in downloads (and views) when switching to [[Simplified_HTTPS_Configuration#HTTPS_Only | HTTPS Only]].  This may be partially due to search engine indexing but is most likely affected by the fact that bots and crawlers (including GoogleBot) will not follow redirects (i.e. from the HTTP URL they already had to the new HTTPS version) and therefore this will not count as a download.  [https://eprints.github.io/irstats2/ IRStats2] has multifarious ways of detecting bots but it is likely a large percentage of downloads will still be due to bots.  Therefore, in some ways the blip may actually give a more accurate picture of the amount of downloads from your repository.  However, looking a raw statistic is generally a bad idea, [https://eprints.github.io/irstats2/ IRStats2] is intended to show usage trends and differences more than absolute downloads or views.

Revision as of 13:55, 6 August 2019


* * * YOU MUST USE EPRINTS 3.4.1++ FOR THE CONFIGURATION BELOW TO BE GUARANTEED TO WORK * * *

Trying to configure EPrints for HTTPS can be difficult and the way the code was previously written, even if you configured HTTPS correctly you could still have issues with mixed content pages, amongst other problems. In EPrints 3.4.1 the underlying code has been improved so that you can configure, host, port, securehost, and secureport in your archive's cfg/cfg.d/10_core.pl in three different ways to get the behaviour. Other configuration options in this file should not need to be changed.

Configurations

Make sure you remove or disable your archive's cfg/cfg.d/https.pl if it exists as it may override the configuration below. Once you have updated your configuration you must run generate_apacheconf to regenerate configuration for Apache before restarting the web server.

HTTP Only

It is advised you avoid using this configuration unless you are developing a repository on a non-publicly accessible web host.

$c->{host} = 'example.eprints.org';
$c->{port} = 80;
$c->{securehost} = undef;
$c->{secureport} = undef;

HTTPS When You Login

This is the current default for EPrints. All publicly accessible pages will use HTTP by default (but still be accessible over HTTPS if you modify the URL) and the login page and all login restricted pages will use HTTPS or be redirected from HTTP.

$c->{host} = 'example.eprints.org';
$c->{port} = 80;
$c->{securehost} = $c->{host};
$c->{secureport} = 443;

HTTPS Only

This ensures that no page (image, CSS, JavaScript file, etc.) will be returned over HTTP and if requested it will be redirected to HTTPS.

You may also want to edit the archive's ssl/securevhost.conf to add the HSTS header.

$c->{host} = undef;
$c->{port} = 80;
$c->{securehost} = 'example.eprints.org';
$c->{secureport} = 443;

Issues and Troubleshooting

Inevitably you may still encounter issues even if you use one of the configurations above, so it is advised you test this on a development or pre-production instance of your repository to check you get the behaviour you expect.

EPrint URI Change

When an EPrint made live it will acquire a URI in the form

http://example.eprints.org/eprint/id/1234

If you switch over to HTTPS Only the above URI will be updated (if you refresh abstracts) to

https://example.eprints.org/eprint/id/1234

For most repositories this will not be an issue but if your repository is harvested by a third party application, it may rely on the URI as a unique identifier and if this changes it may that all the EPrints are new as none of the URIs are the same as before.

For third party applications that integrate through the Bazaar (EThoS, PIRUS, Symplectic Repository Tools, etc.) no problems relating to this have been identified. However, if your repository has a bespoke third party application this may be affected and is something you should test beforehand if possible but as soon as you go live with the new configuration otherwise.

If you need to ensure your EPrint URIs do not change you can add the uri_url configuration option at the end of your archive's 10_core.pl configuration as follows:

$c->{uri_url} = "http://" . $c->{securehost};

OAI-PMH (e.g. http://example.eprints.org/cgi/oai2 and https://example.eprints.org/cgi/oai2) provide different relations (http or https) for a publication but the OAI identifier is protocol independent and therefore stays the same. Therefore, third party applications that make use of OAI-PMH should not be affected if they harvest as the protocol specifies.

Search Engine Indexing

It has been observed in the past that some items may briefly disappear from the Google search index when switching to HTTPS Only. There is no way to guarantee this will not happen. One way to try to mitigate and keep on top of this is to setup a Google Webmaster account and register your repository's hostname. After a couple of days this should get populated with all the pages indexed for your repositories, if there are any missing you can submit these to Google to be re-added.

IRStats2 Blip in Downloads

It has also been observed that repositories see a brief drop in downloads (and views) when switching to HTTPS Only. This may be partially due to search engine indexing but is most likely affected by the fact that bots and crawlers (including GoogleBot) will not follow redirects (i.e. from the HTTP URL they already had to the new HTTPS version) and therefore this will not count as a download. IRStats2 has multifarious ways of detecting bots but it is likely a large percentage of downloads will still be due to bots. Therefore, in some ways the blip may actually give a more accurate picture of the amount of downloads from your repository. However, looking a raw statistic is generally a bad idea, IRStats2 is intended to show usage trends and differences more than absolute downloads or views.