# CS244 ’17: Mahimahi: Accurate Record-and-Replay for HTTP

Jiaxin Guan (jxguan@stanford.edu)
Wen Zhang (zhangwen@stanford.edu)

# Introduction

This paper presents an HTTP record-and-replay tool called Mahimahi. It records HTTP traffic between a client and a server and can later replay it under simulated network conditions (e.g., an imposed maximum throughput and minimum RTT).

According to the paper, Mahimahi has three novel aspects compared to other HTTP record-and-replay frameworks:

1. Accuracy: When replaying HTTP traffic, Mahimahi faithfully replicates the multi-server nature of Web applications by spawning a separate Apache server for each server contacted during recording. This enables accurate measurements of page load times.
2. Isolation: Mahimahi isolates its traffic from the rest of the host system using separate network namespaces. This eliminates traffic interference and enables reproducible performance measurements with low overhead.
3. Composability: Mahimahi is split into composable shells, which makes it easy to use and extend. For example, one can easily launch Google Chrome inside a DelayShell inside a ReplayShell to replay HTTP traffic with an imposed RTT.

The authors designed experiments and/or explained use cases to establish points.

We think that Mahimahi is exciting research because:

• A reliable and easy-to-use HTTP record-and-replay tool is instrumental to, e.g., rapid iteration in the development of new transport protocols such as SPDY and QUIC.
• Mahimahi emphasizes usability, which is essential for adaptation in the greater networking community outside research. As discussed below, we found it much easier to use than web-page-replay, another HTTP record-replay tool.

# Reproduction

## Subset goal

The paper claims that the ReplayShell, which is in charge of replaying HTTP traffic, achieves higher performance measurement accuracy by emulating the multi-server nature of Web applications. The emulation is done by launching, inside a separate network namespace, one Apache server for each server encountered during HTTP record, so at replay time, requests for different hosts get routed to different Apache servers.

To substantiate this claim, the authors measured the page load time (defined as the time elapsed between the nagivationStart and loadEventEnd events) for 20 Web pages from the Alexa US Top 500 while imposing a minimum RTT of 100 ms and a maximum throughput of 5 Mbits/s. They then did the same measurement on replayed traffic using the original Mahimahi ReplayShell, a modified ReplayShell that serves all resources from a single server, and Google’s web-page-replay, under the same emulated network conditions.

They found that the median measurement error achieved by the multi-server ReplayShell (12.4%) is lower than that achieved by the single-server ReplayShell (20.5%), which is in turn lower than that of web-page-replay (36.7%). [Here error” is defined as the absolute value of the percent difference between mean page load times (over 25 runs) within an emulation environment and on the Internet.”] The errors are depicted in Figure 3 of the paper:

We seek to reproduce this graph detailing the measurement accuracy of the three different tools/approaches.

## Subset motivation

We were interested in verifying the utility of Mahimahi’s novel features, of which there are three: multi-server emulation, isolation, and composability and extensibility.

• Reproducing the composability and extensibility aspect might involve us reporting on our experience of using Mahimahi to evaluate transport protocols this feels a bit subjective for the purposes of this project.
• Reproducing the isolation aspect involves evaluating reproducibility of measurements (i.e., running multiple measurements on different machines) and the shells’ overhead (i.e., measuring performance inside no-op DelayShell and LinkShell). We thought these might be too trivial for the purposes of this project.
• Reproducing the multi-server emulation aspect directly reveals the effect of spawning multiple servers and also involves comparing with another tool. We find this aspect the most interesting to work on.

## What we did

On a high level, we performed these steps based on Section 4.1 of the paper:
1. Pick 20 websites from Alexa US Top 500 sites as the corpus.
2. For each site $i$ ($1\leq i\leq 20$):
1. Repeat 25 times: Inside a LinkShell with 5 Mbits/s throughput (separate up & down) inside a DelayShell with 100 ms one-way delay, measure page load time using Chrome driven by Selenium. Compute average $\overline{U}_i^{\text{M}}$.
2. Repeat 25 times: Using TrafficShaper (modified version; see paragraph below) from web-page-replay with 5 Mbit/s bandwidth (separate in & out) and 200 ms RTT delay, measure page load time. Compute average $\overline{U}_i^{\text{WPR}}$.
3. Launch a RecordShell to record the HTTP traffic from Chrome visiting the same URL.
4. Use web-page-replay to record the same traffic.
5. Repeat 25 times: Inside a LinkShell and a DelayShell (as specified above) inside a ReplayShell (of the recorded traffic), measure page load time. Compute average $\overline{T}_i^{\text{MM}}$.
6. Repeat 25 times: Same as the previous step, except that we measure under a modified ReplayShell that uses only one Apache server. Compute average $\overline{T}_i^{\text{MS}}$.
7. Repeat 25 times: Use web-page-replay to replay the recorded traffic, simulating 5 Mbit/s bandwidth and 200 ms RTT delay, and measure page load time. Compute average $\overline{T}_i^{\text{WPR}}$.
8. Compute absolute percentage error for each type of measurement:
1. For multi-server RecordShell: $E_i^{\text{MM}}=\left|\overline{T}_i^{\text{MM}}-\overline{U}_i^{\text{M}}\right|/\overline{U}_i^{\text{M}}\times 100$.
2. For single-server RecordShell: $E_i^{\text{MS}}=\left|\overline{T}_i^{\text{MS}}-\overline{U}_i^{\text{M}}\right|/\overline{U}_i^{\text{M}}\times 100$.
3. For web-page-replay: $E_i^{\text{WPR}}=\left|\overline{T}_i^{\text{WPR}}-\overline{U}_i^{\text{WPR}}\right|/\overline{U}_i^{\text{WPR}}\times 100$.
3. Plot the empirical CDF for $E_i^{\text{MM}}$, $E_i^{\text{MS}}$, and $E_i^{\text{WPR}}$ $(1\leq i\leq 20)$.
To perform these steps, we modified/enhanced the source code of Mahimahi and web-page-replay mainly in two places:
1. We modified the Mahimahi ReplayShell to support single-server replay. Specifically, in single-server mode (enabled by the –single-server command line flag):
• ReplayShell spawns only one Apache server inside the network namespace.
• The Apache server is configured to listen on an arbitrarily-picked (or not) external IP address, for which a network interface has been created inside the namespace. No other network interface for external IP addresses are created.
• The server listens on all ports on which RecordShell has previously seen traffic, serves HTTP on all of them, but only serves HTTPS on 443 (as does the original version).
• The fake DNS server, used during replay, resolves all host names seen during recording to that specific external IP address.
2. We created a custom version of web-page-replay’s TrafficShaper to perform direct measurement under network conditions simulated by ipfw:
• We think that web-page-replay by itself only supports traffic shaping during replay, while we also need it for direct measurement.
• In the original TrafficShaper, it seems that traffic shaping is only applied to traffic to/from the local proxy server, which performs the recording.
• In our version, traffic shaping is applied to all traffic that doesn’t go through the loopback IP address 127.0.0.1. We excluded port 22 so that SSH connections don’t get affected.
Here’s our experiment environment:
• OS: Debian GNU/Linux 8 (jessie) image from Google Cloud Platform, with its kernel upgraded to version 3.18.44 (see Challenges” section).
• This environment differs from what the original authors used, Ubuntu 13.10. We discuss our choice of OS in the Challenges” section.
• VM instance: 2 vCPUs, 7.5 GB memory, us east-1c zone. This is to approximate the original authors’ Amazon EC2 m3.large instance located in the US east-1a region”.
• Disk: 20 GB standard persistent disk.

## Subset result

Here’s the graph we generated:

This graph indicates, as does the original graph, that the multi-server measurements had lower measurement error than the single-server measurements did.

What is inconsistent with the original graph, however, is the measurement errors for web-page-replay: our web-page-replay measurement errors were lower than those from the original paper. A couple of comments on this point:

• The original authors claim that they were not certain why single-server ReplayShell is so much more accurate than web-page-replay.”
• Even though we configured Mahimahi and our custom TrafficShaper to simulate the same network conditions, our web-page-replay baseline measurements were in general much higher than our Mahimahi baseline measurements (by a median of 79%).
• We noticed that web-page-replay hard-codes a finite queue size in its traffic shaping mechanism. Maybe this caused the increased load times. We didn’t have time to verify this hypothesis.
• The discrepancy might also be due to the different environment that we ran on.
• Of course, it’s possible that we entered the wrong parameters for either tool (or had some other bug), although we did manually verify the latency (with ping) and throughput (by downloading a large file) under each environment. Furthermore, web-page-replay is run unmodified for replay measurements.

## Challenges

web-page-replay
We experienced tremendous difficulty in getting web-page-replay to work.

Initially, we were unable to get web-page-replay to properly record HTTP traffic. When we ran it in record mode, we got a bunch of Number of active connections surpasses the supported limit of 500” errors. When we force-killed the process, Wen’s laptop was no longer able to access the Internet. (Poor Wen! This was fixed by restoring the DNS resolving configuration. No such thing ever happened with Mahimahi!) We later discovered that the issue was with web-page-replay’s dns forwarding module. We were able to get web-page-replay to work by setting the –no-dns_forwarding flag and manually configuring Chrome to forward all traffic to localhost.

With the steps taken above, we were able to use web-page-replay to record and replay HTTP traffic as desired. However, there were issues with network simulations (latency, up/down link speed). web-page-replay relies on the ipfw functionality of dummynet to achieve network emulations, which no longer works on Linux kernels version 4 and above. We explored these alternatives:

• We tried running the experiments on Ubuntu 13.10 (as the original authors did). After hunting down a disk image, we had difficulties installing the prerequisites for Mahimahi on the OS (had to compile two packages from source), and then we encountered a compiler bug while compiling Mahimahi. We considered upgrading g++, but were worried that we might encounter even more problems down the line because, after all, Ubuntu 13.10 is no longer supported. We were also unable to do VM migration onto Google Cloud (the signin popup just disappears), so we gave up.
• The web-page-replay project suggests using tsproxy for traffic shaping in place of ipfw. However, tsproxy made web page loading absurdly slow with Selenium and Chrome, even with a very low latency and very high link speeds (Github issues here and here).
• netem can also be used to shape traffic, but we found it a bit difficult to control the down link speed with it.
• Eventually, we started with a Debian 8 VM instance, compiled and upgraded the Linux kernel to version 3.18.44 (after finding out and ticking the IP filter-related modules that Mahimahi requires), compiled dummynet against the new kernel source, installed the dummynet kernel module, and was finally able to run web-page-replay with net emulation options. (We found these instructions in Chinese super helpful.)
Running experiments

Another source of challenge was that running the entire experiment took very long to run:

• For each of the 20 sites, we need to perform two recordings and 25 * 5 = 125 measurements.
• Each measurement averaged ~16 seconds, so a conservative run time estimate is 25 * 5 * 20 * 16 = 11 hours (not counting setup/teardown and recording time).

We’d like decrease the run time so that we can run multiple experiments and iterate rapidly.

To this end, we launched 12 Google Cloud VM instances in one region (exhausting our 24-CPU quota), assigned each website to an instance based on the run time estimate of each website, and ran the experiment in parallel semi-automatically (using a combination of Ansible and bash scripts). This allowed us to, for example, run two sets of experiments in less than ~5 hours.

## Critique

We find the paper quite clear on the experiment setup and steps. The only point we’re not super sure about is how they measured the errors for web-page-replay:

• The text of Section 4.1 seems to suggest that the only baseline run time was measured within Mahimahi’s DelayShell and LinkShell.
• If we took this route, we’d no longer feel justified to use web-page-replay’s own network shaping tool. If web-page-replay’s replay measurements are compared with the baseline measured with Mahimahi, then web-page-replay’s error could be attributed to the discrepancy in network condition emulation between the two tools. So we’d need to use Mahimahi to emulate network conditions while measuring performance using web-page-replay’s replay mode.
• We’re not sure if the authors did this. Neither do we find this possible without modifying Mahimahi: web-page-replay, in replay mode, seems to resolve all host names to 127.0.0.1; Mahimahi’s network namespace, however, has its own loopback interface, which means that traffic in there won’t reach web-page-replay’s replay server on the outside.

This is why we did two sets of baseline measurements, one using Mahimahi’s shells, the other using a minimally-modified version of web-page-replay’s TrafficShaper based on ipfw, as described above.

## Extensions

When we were recording traffic with Mahimahi’s RecordShell, two types of error messages caught our attention:

• Apparently, the HTTP OPTIONS method is not supported.
• We see SSL-related error messages along the lines of SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure”.

We tried to fix these errors:

• We imitated the code for other HTTP methods and add support for OPTIONS.
• We discovered the cause of the SSL error: the ReplayShell man-in-the-middles HTTPS connections as a proxy; the proxy accepts a connection from the client and makes the same connection to the destination; when the client sends a hostname using the SNI extension, the proxy fails to send the same hostname to the server; since some servers require SNI, the connection may consequently fail. We wrote a fix for this; we used a coarse mutex, but it shouldn’t impose too much overhead for our experiments.
We re-ran the measurements with these two fixes in place, and saw curious results:

It seems that the distance between single- and multiple-server errors has come closer. We then plotted the single- and multiple-server measurements separately with/without fixes.

These two plots suggest that the fixes didn’t affect the multi-server measurements that much, but did improve the single-server measurements.

We’re not sure why the two fixes allowed the single-server version to catch up, especially considering that both our fixes only concern the RecordShell, not the ReplayShell. This might be an interesting phenomenon to look into.

## Reproducing our reproduction

Here we provide instructions for running the experiment sequentially. Since our measurements take a long time to run, we provide a shorter version with eight websites, running each measurement only five times.

Instructions

We apologize for the complicated instructions; this is the only way we could find. The complications arise from our reliance on specific Linux kernel features (see the Challenges” section); this is why we’re providing an image for you to use. We assume you have created a project created on Google Cloud.

A better-formatted version is available on Github here.

1. Create a Google Storage bucket in your project like this. (You can choose the Multi-Regional storage class and United States as the location.)
3. Upload our image to your bucket like this. You should now see our image listed in the bucket.
4. Go to the Images page and click on Create Image. Give it a name, choose Cloud Storage File for Source, click on Browse, and locate the image you just uploaded to your bucket. Click on Create.
5. Launch a VM from the VM Instances page (click on Create Instance) with these parameters:
• Zone: us-east1-c
• Machine type: 2 vCPUs (should say 7.5 GB memory by default)
• Boot disk: click on Change, go to the Custom images tab, and choose the image you imported in Step 4; use 20 GB of Standard persistent disk
• Firewall: Allow HTTP traffic
• Click on Management, disk, networking, SSH keys. Under Metadata, enter:
• Key: serial-port-enable
• Value: 1
• Click on Create and wait for the instance to be launched.
6. When the instance has been successfully launched, click on the instance name, scroll to the bottom, and click on Connect to serial port.
7. When the console window has loaded, enter fsck /dev/sda1 -f. When asked whether to fix errors, always press y. When fsck is done, close the console window.
8. Click on Reset and wait for a minute or two.
9. Find the External IP on the same page, and ssh into it (user: cs244, password: cs244)
• You can type ssh cs244@xx.xx.xx.xx in the terminal and enter the password.
• Alternatively, go back to the VM Instances page and click on SSH next to this instance. After you’re logged in, type su cs244, enter the password, and type cd to go back to the home directory.
10. Type cd cs244-pa3-mahimahi.
11. You’re now ready to run the experiment! The short version may take up to 1.5 hours. To run the experiment in the background, type screen and press , then type ./reproduce.sh. You should see something like:
DELAY = 100, TRACE = 5Mbps_trace, RUNS = 5

Sun Jun  4 21:54:11 UTC 2017
Measuring http://www.ask.com
Mahimahi raw...
1. You can now press Ctrl-A then Ctrl-D to quit screen. Feel free to terminate the SSH connection and come back later.
2. When you come back, resume screen by entering screen -r. If the experiment has finished, you should see a URL for the graph. Go to that URL in your browser. You should hopefully see a similar but coarser-grained graph, something like this (if it looks very different from our main graph above, it might be because the number of trials/websites was too small):
Further information

Our Github repository is here. However, we don’t think the code will work on any arbitrary machine because it relies on specific features in the Linux kernel.

If you’d like to run the entire experiment, open up reproduce.sh, replace “some-websites.txt” with “all-websites.txt”, change “runs” to 25, and run the script.