CS244 ’16: Elephants vs. Lightning – A Comparison of Hadoop and Spark on Iterative Machine Learning

Vivek Jain and Jacqueline Speiser

Introduction and Motivation

Data is growing at exponential rate as the reach of technology expands and the breadth and depth of collection increases. One method used to process and analyze this expanding data is to distribute the computation across a cluster of many machines working in parallel. Cluster computing frameworks like MapReduce [1] provide a convenient interface that allows users to write parallel computations with high-level operators, without the need to explicitly deal with work distribution and fault tolerance. However, MapReduce and similar frameworks lack abstractions for leveraging distributed memory. Iterative machine learning applications such as PageRank, k-means clustering, and logistic regression, as well as interactive data mining, benefit greatly from keeping intermediate data in memory because it is reused across multiple computations. Existing approaches before this paper [2] were either slow or specialized to a particular task.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing presents a new abstraction for in-memory computations in a fault-tolerant yet performant manner. Rather than storing all intermediate data on disk, only the lineage graph of the computation is logged to disk, allowing reconstruction via parallel computation of the necessary data in the event of a failure. The authors show that RDDs enable users to express a wide variety of computations in a single programming model, and implement this abstraction in a system called Spark.

Original Results

The paper evaluates Spark against Hadoop using several different algorithms, including logistic regression, k-means, and PageRank, and find that Spark is up to twenty times faster. The authors also present several different applications that they were able to program using Spark, including traffic modeling and spam classification.

One of the primary results in the paper compares Spark, Hadoop, and HadoopBinMem on logistic regression and k-means, on clusters of 25-100 machines and random datasets of 100GB for 10 iterations. HadoopBinMem is an optimized version of Hadoop that uses a binary format and writes intermediate data to an in-memory file system. These results are shown in figures 7 and 8. After the first iteration of logistic regression, Spark is 25.3 and 20.7 times faster than Hadoop and HadoopBinMem, respectively, and after the first iteration of k-means Spark is 1.9 and 3.2 times faster. This speedup is not as pronounced on the first iteration because all three systems read from the file system on their first iteration.

Screen Shot 2016-05-28 at 9.21.28 PM

Screen Shot 2016-05-28 at 8.54.44 PM

Our Experiment

For our project, we chose to reproduce the logistic regression results of Figures 7 and 8 from the original paper. These results stood out as some of the most important in the paper, since they very directly measure Spark’s performance on a standard machine learning algorithm. Rather than running the old versions of Spark and Hadoop from the paper, however, we decided to run the latest versions. Both have changed in the past four years, and we were curious to see whether the paper’s results would hold up with the modern versions.

Our Results

The following graphs show the iteration times of Spark and Hadoop running logistic regression on 10GB of data with 25- and 100-node clusters.

newplot (2)

Figure 1. Iteration Times for 100-Node Cluster


Figure 2. Iteration Times for 25- and 100-Node Clusters

newplot (1)

Figure 3. Spark Iteration Times for 25- and 100-Node Clusters


The fact that our Hadoop iterations are so slow is most likely a product of the implementation. We implemented a naive version of logistic regression, which grouped all the data under the same key in the reduce phase. We thought that using a Combiner would be able to handle this gracefully, but that does not appear to be the case.


We were originally going to run our experiments on both logistic regression and k-means, but had to scale back our plans when we ran into issues with k-means. Running the algorithms on Spark was relatively straightforward (Spark actually comes with examples of several machine learning algorithms), but Hadoop was another story. We quickly realized that Hadoop is not meant for iterative machine learning algorithms, and spent a while figuring out how to make them work properly. Although it worked, our implementation of k-means on Hadoop was too slow to be practical on our large test dataset. In addition, we had trouble recompiling Spark’s built-in version of k-means after we made some changes for performance measurement purposes, so we scrapped the k-means part of the experiment. Our Hadoop k-means code can still be found in our git repo.

Another challenge was generating a large amount of random data on HDFS. We were originally going to reproduce the 100GB data set from the paper, but we found it surprisingly difficult to generate random data on HDFS without storing it all on disk or memory on a single node.

A final challenge was cluster management. The EC2 API is not always user-friendly, making it that much more difficult to produce a script that can be run reliably to reproduce results.


Our results do seem to corroborate the paper’s results. Even if the numbers are not strictly accurate, it is informative that it is so much easier to implement and run iterative machine learning algorithms on Spark compared to Hadoop. Furthermore, keeping intermediate data in memory is a clear advantage since the later iterations of Spark run much quicker than the first.

The paper found an almost linear decrease in iteration times after the first, as the number of machines scaled. However, we did not find this to be the case. We believe the primary reason is that we used a smaller data size, and Spark has a minimum overhead on each iteration that limits times to over a second.

There are a variety of parameters that both Spark and Hadoop have to configure them. Although we did not get to fully explore the range of options and their effect on the speed of these platforms, from our understanding the performance-critical parameters were roughly equivalent. Spark continues to be relevant today, and is widely used as a distributed machine learning platform.

Spark is clearly an impressive system. However, we question whether comparing it to Hadoop is really the best metric of its performance. Hadoop is not built for iterative algorithms like logistic regression and k-means, while Spark quite explicitly is. Hadoop was used for machine learning at the time that this paper was written, but today there are probably better comparisons that could be made.


The choice to run our experiments on EC2 with m4.xlarge instances was an easy one, since xlarge instances were also used in the original paper. The entire setup is easily reproducible, but requires a special request because by default Amazon only allows you to launch 20 instances at a time. The type of the instance is important for reproducibility, both for performance comparisons and because smaller instances may not have enough memory to run all of the necessary programs.

Reproduction Instructions

  1. By default Amazon only allows you to launch 20 instances at a time. To increase this limit, fill out this form, and request 100 m4.xlarge instances in US West (Oregon). For the use case, write something like “For our class project, we want to reproduce the results of the Spark paper (Resilient Distributed Datasets). This will require us to run 100 on-demand xlarge instances.” It takes at least a day for this request to be approved, so start early!
  2. Clone our git repo by running git clone https://github.com/viveksjain/repro_rdd.git.
  3. Go to this page, expand the section “Access Keys (Access Key ID and Secret Access Key),” click “Create New Access Key,” and download the key file.
  4. From the repo directory, run python run.py <AWSAccessKeyId> <AWSSecretKey> .
  5. Wait for the script to complete. The script will automatically download the timing data after it has been generated.
  6. Open the IPython notebook plot.ipynb inside the repo. You will need IPython, plotly and numpy packages installed. Running all the cells inside the notebook should regenerate the graphs using your data.


  1. Dean, J., Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. OSDI’04.
  2. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., … Stoica, I. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. USENIX Association Berkeley.

One response to “CS244 ’16: Elephants vs. Lightning – A Comparison of Hadoop and Spark on Iterative Machine Learning

  1. 5 For Reproducibility. Might be good to improve instance teardown, we had trouble overreaching our limit when the first 25 didn’t terminate. It would also be useful to have more explicit ipython instructions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s