CS244 2017: Is HTTPS still slow?


Reproducing “The Cost of the ‘S’ in HTTPS” originally by David Naylor et al., 2014

Reproduction by Benton Case and Zak Whittington, Spring 2017

Background

“The Cost of the ‘S’ in HTTPS,” published in December 2014, attempts to quantify the prevalence and added costs of HTTPS web traffic. At the time of writing, the conventional wisdom was that HTTPS was growing rapidly in popularity, and that HTTPS usage was imposing some cost on both clients and servers. Given that HTTPS was growing so ubiquitous, the authors wanted to provide a comprehensive analysis of the specific types of costs it could induce. They took quantitative measurements where possible, and also included qualitative discussion of topics they were unable to directly measure. Their primary takeaway was that extra latency imposed by HTTPS is non-negligible and noticeable to usersespecially over 3G cell networkswhile most other effects are negligible or ambiguous. Our results contradict the original finding: we found HTTPS had negligible effects on load time for the majority of websites, and was even faster than HTTP for almost a third of sites tested, suggesting the landscape of web encryption has changed dramatically in the last 2-3 years.

We chose to reproduce this research because HTTPS has become substantially more popular since this paper was written, and we would like to update their results. About 55% of traffic by volume is encrypted today, up from 38% in 2014. Today, about 75% of time spent browsing is on HTTPS sites according to Google’s Chrome statistics. Moreover, in the time since the paper was published, the characteristics of HTTPS traffic have changed: in 2014 almost all HTTPS traffic was still “small, privacy-sensitive objects,” whereas today it is increasingly seen as the default protocol for all types of traffic, including video streaming. In 2014, YouTube was just beginning to rollout HTTPS video streaming; today, essentially all YouTube traffic is encrypted (97%), Netflix, which makes up approximately a third of all web traffic by volume, is serving “most streams” over TLS, and many other streaming services are following suit [1]. Considering the new web traffic patterns, we expect changes in certificate verification, improvements in network speed and latency, and modern browser implementations may have had an impact on the overhead costs of HTTPS.

The paper considered many forms of potential cost, only some of which we will seek to reproduce. Broadly, they considered usage trends, webpage load time, TLS handshake data overhead, and cell phone battery usage. The authors collected per-flow logs from a vantage point monitoring 25,000 customers of a major European ISP for two years from 2012 to 2014, which they used to create their usage trends and TLS handshake overhead analysis. We are unable to reproduce these results due to lack of access to data of this scope, but we also feel it is unnecessary to update these results. Since encryption has become a topic of interest to the general public, several large tech companies including Google and Mozilla have begun publishing public data on HTTPS usage, likely with a higher degree of accuracy than anything we, or even the original authors, could muster.

Instead, we will focus on reproducing what we consider to be the most interesting quantitative result from their paper: added latency over mobile and fiber networks. The original paper found that a significant portion of sites were noticeably (>0.5 seconds) slower over HTTPS. As the paper argued, in a world where users expect sites to load in under two seconds [4] and a one second delay could cost a company billions [3], delays of this magnitude are non-negligible. The original authors used a headless browser called PhantomJS to measure the page load time of the Alexa Top 500 sites over HTTP and HTTPS, averaged over 20 loads each, repeated over 3G and fiber connections. They presented this data in a cumulative distribution function (see Figure 5 below), demonstrating that for about 90% of websites over 3G, and for 40% of websites over fiber, extra latency was more than 500ms. This methodology is simple but compelling, suggesting that HTTPS is noticeably slower than HTTP to the end user, especially over 3G.

Reproduction Methodology

We also used PhantomJS, and ran our tests over fiber and 4G networks. The fiber tests were conducted from a computer running on the Stanford network, and the mobile tests were run from a Macbook tethered to an iPhone connected over a T-Mobile 4G LTE network. We also revived the original author’s code and confirmed that their code generated similar results.

Input Data: Like the original paper, we used the Alexa Top 500 sites as our initial sample.

Filtering: Though the original paper did not explicitly mention this, according to email communications with author David Naylor, they only tested on sites that served responses over both HTTP and HTTPS. We filtered similarly, dramatically reducing the number of eligible sites. If a site ever served a URL redirect to an HTTPS site from what was originally an HTTP response, or if a site failed to respond with a OK 200 response to an HTTPS request, or failed to respond at all, we removed them from our sample. We suspect that some sites failed to respond because they detected we were using PhantomJS instead of a normal browser. 

Timing: Our timing script was also written from scratch, and sends 4 synchronous, consecutive requests to each website, first over HTTP, then over HTTPS, and measures the time from sending the initial request to once the page (including all its resources) is completely loaded, all of which is recorded in a measurements file, and only some of which is used to generate the plots.

Plotting: Our plotting script was written from scratch, runs in Python on the output generated by the timing script, and uses matplotlib to generate CDF plots.

Mobile: Our script for mobile was nearly identical to the script for fiber, but also included an HTTP user agent header mimicking a Galaxy Nexus phone running the mobile version of Chrome, and also faked a display size of a Galaxy Nexus.  We also tested over 4G rather than 3G, since 4G is now used by over 85% of Americans [6].

Results

Protocol Compatibility: While the original paper talked about the prevalence of HTTPS browsing, it used a different dataset with different metrics to do so; it did not discuss the prevalence of HTTP(S) websites within the Alexa Top 500 Sites. Our script filtered to only time sites that served both HTTPS and HTTP without HTTPS redirects, resulting in a total of 182 viable sites. We expect some sites may actually have served either HTTP or HTTPS, but chose not to respond to requests that originated from a headless Phantom browser.

Table 1: Breakdown by Category of Alexa Top 500 Sites’ Available Protocols
Category Number of Sites
Both HTTP and HTTPS 182
Only HTTP 97
Only Redirects to HTTPS 194
Neither HTTP nor HTTPS 27
Total 500

Fiber: Our results over fiber differ significantly from the original paper’s results. We expect these differences are due to the changing landscape of the internet as opposed to deficiencies in methodology. It is clear that the HTTPS landscape has changed dramatically. HTTPS has exploded in popularity: twice as many sites serve only HTTPS than serve only HTTP. In many cases, HTTPS is actually faster than HTTP these days, which was so rare in 2014 that it wasn’t discussed as a possibility in the original paper. Our CDF’s are presented below, along with some salient takeaways.

Mobile: These results differ with the original paper’s results to an even greater degree, in that HTTPS often saved a significant amount of download time compared to HTTP. In about 30% of sites, load times were noticeably faster when over HTTPS than when over HTTP,  and there was no significant difference on most of the remaining sites.

Over both fiber and Mobile, we found that for the vast bulk of sites, HTTPS adds no distinguishable latency over HTTP, and is even faster than HTTP in many cases, directly contradicting the findings of the original paper.

Original Results

original-results.PNG

Our Reproduced Results

scratch-ratio

scratch-differencescratch-ratio-mobile

scratch-difference-mobile.png

Takeaways

  • The original paper’s conclusion is that HTTPS introduces non-negligible latency; our findings contradict that claim
  • Only roughly 15% of sites are noticeably (>500ms or 1.3x) slower over HTTPS today, versus 40% in the original paper
  • ~80% of sites saw no difference between HTTPS and HTTP load times today, versus 55% in the original paper  
  • HTTPS is actually faster than HTTP for about 30% of sites today, versus a negligible (<3%) number of sites in 2014

Reproduction Instructions

The simplest version of our reproduction scripts is designed to run seamlessly on the CS244 VM. They can also be run relatively easily on any machine; see the git readme file for more details.

Easy Reproduction

Step 1: Download the CS244 Ubuntu VM

Step 2: From within the VM, clone our public github repo

git clone https://github.com/Zak244/244-HTTPS.git https-cost

Step 3: Run Setup and Timing Scripts

cd https-cost
./setup
./run_tests

The output should be four plots: ratio-mean.png, ratio-median.png, difference-mean.png, and difference-median.png. The scripts will rely on the Alexa data gathered on May 30, 2017 by default.

If interested, more options for reproduction are detailed in the git readme file.

Final Thoughts

We had to limit the scope of our research due to resource constraints; we lacked access to many of the datasets and technologies used by the original team. However, one of the core findings of the paper was that HTTPS introduced non-negligible latency compared to HTTP, and our results were very different from those. We encountered no significant challenges reproducing this section of the work, and we have several ideas for possible future extensions.

We interpreted the original paper as making broad claims about what the average internet user could expect from HTTPS vs HTTP browsing. However, one downside to their research is that it uses an obscure headless browser called PhantomJS, which is detectable by servers, and may thus be getting unrealistic results. This is particularly problematic for mobile sites. The original paper did not discuss any attempts to make the PhantomJS browser appear to be a mobile browser; they seem to have run the same tests using a 3G USB modem from their laptop. Today, many websites use browser detection heuristics to serve specialized mobile content, so we tried to make our Phantom browser look like a mobile browser using all the (somewhat limited) flexibility available to us in the Phantom interface. We suggest an extension that runs similar tests using a headless browser that more realistically emulates popular browsers like Chrome and Firefox, in both desktop and mobile environments.

Citations

[1] https://www.theverge.com/2016/8/1/12341686/youtube-google-traffic-https-encryption-protected
[2] https://www.google.com/transparencyreport/https/metrics/?hl=en
[3] https://www.fastcompany.com/1825005/how-one-secondcould-cost-amazon-16-billion-sales
[4] https://www.oneupweb.com/blog/need-speed-truth-page-load-time/
[5] https://www.w3counter.com/globalstats.php
[6] https://opensignal.com/reports/2017/02/usa/state-of-the-mobile-network
Advertisements

One response to “CS244 2017: Is HTTPS still slow?

  1. Reproducibility: 4/5

    I had a couple of problems trying to reproduce the results while following the instructions (see below) but I was able to reproduce similar looking graphs after some tweaking. The graphs output by createPlot.py have the same general trend but all of the points within the plot are blue. I suspect this may have been a matplotlib version mismatch though I used the provided virtual machine and followed the setup instructions. In any case the experiment revealed some interesting results and the blog post demonstrates understanding of the original paper.

    As an aside, I would have liked to see more in terms of analysis on the result that some sites load faster over HTTPS considering the additional RTT’s required for the TLS handshake. Do you know whether phantomjs uses HTTP/2 and multiplexing?

    Thanks,
    Sam

    ———————————
    Missing instructions
    • Password for the VM is CS244
    • Setup needs to be run as sudo or it will throw an exception
    • ./run_tests as listed in the blog post runs the original authors scripts (according to https://github.com/Zak244/244-HTTPS). One instead needs to read the ‘Alternate Instructions’ in order to reproduce “the measurements we published in our blog post”
    • The alternate instructions are vague and don’t tell me how to configure the network. I assumed that the timing test needs to run once over fibre (with timePages.js) and once over mobile (with timePages-mobile.js) but this isn’t explicitly mentioned. There is no plotting script for mobile (fibre is hardcoded into the title) so I just used createPlot.py again.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s