Vivek Jain and Jacqueline Speiser
Introduction and Motivation
One of the primary results in the paper compares Spark, Hadoop, and HadoopBinMem on logistic regression and k-means, on clusters of 25-100 machines and random datasets of 100GB for 10 iterations. HadoopBinMem is an optimized version of Hadoop that uses a binary format and writes intermediate data to an in-memory file system. These results are shown in figures 7 and 8. After the first iteration of logistic regression, Spark is 25.3 and 20.7 times faster than Hadoop and HadoopBinMem, respectively, and after the first iteration of k-means Spark is 1.9 and 3.2 times faster. This speedup is not as pronounced on the first iteration because all three systems read from the file system on their first iteration.
For our project, we chose to reproduce the logistic regression results of Figures 7 and 8 from the original paper. These results stood out as some of the most important in the paper, since they very directly measure Spark’s performance on a standard machine learning algorithm. Rather than running the old versions of Spark and Hadoop from the paper, however, we decided to run the latest versions. Both have changed in the past four years, and we were curious to see whether the paper’s results would hold up with the modern versions.
The following graphs show the iteration times of Spark and Hadoop running logistic regression on 10GB of data with 25- and 100-node clusters.
The fact that our Hadoop iterations are so slow is most likely a product of the implementation. We implemented a naive version of logistic regression, which grouped all the data under the same key in the reduce phase. We thought that using a
Combiner would be able to handle this gracefully, but that does not appear to be the case.
We were originally going to run our experiments on both logistic regression and k-means, but had to scale back our plans when we ran into issues with k-means. Running the algorithms on Spark was relatively straightforward (Spark actually comes with examples of several machine learning algorithms), but Hadoop was another story. We quickly realized that Hadoop is not meant for iterative machine learning algorithms, and spent a while figuring out how to make them work properly. Although it worked, our implementation of k-means on Hadoop was too slow to be practical on our large test dataset. In addition, we had trouble recompiling Spark’s built-in version of k-means after we made some changes for performance measurement purposes, so we scrapped the k-means part of the experiment. Our Hadoop k-means code can still be found in our git repo.
Another challenge was generating a large amount of random data on HDFS. We were originally going to reproduce the 100GB data set from the paper, but we found it surprisingly difficult to generate random data on HDFS without storing it all on disk or memory on a single node.
A final challenge was cluster management. The EC2 API is not always user-friendly, making it that much more difficult to produce a script that can be run reliably to reproduce results.
Our results do seem to corroborate the paper’s results. Even if the numbers are not strictly accurate, it is informative that it is so much easier to implement and run iterative machine learning algorithms on Spark compared to Hadoop. Furthermore, keeping intermediate data in memory is a clear advantage since the later iterations of Spark run much quicker than the first.
The paper found an almost linear decrease in iteration times after the first, as the number of machines scaled. However, we did not find this to be the case. We believe the primary reason is that we used a smaller data size, and Spark has a minimum overhead on each iteration that limits times to over a second.
There are a variety of parameters that both Spark and Hadoop have to configure them. Although we did not get to fully explore the range of options and their effect on the speed of these platforms, from our understanding the performance-critical parameters were roughly equivalent. Spark continues to be relevant today, and is widely used as a distributed machine learning platform.
Spark is clearly an impressive system. However, we question whether comparing it to Hadoop is really the best metric of its performance. Hadoop is not built for iterative algorithms like logistic regression and k-means, while Spark quite explicitly is. Hadoop was used for machine learning at the time that this paper was written, but today there are probably better comparisons that could be made.
The choice to run our experiments on EC2 with m4.xlarge instances was an easy one, since xlarge instances were also used in the original paper. The entire setup is easily reproducible, but requires a special request because by default Amazon only allows you to launch 20 instances at a time. The type of the instance is important for reproducibility, both for performance comparisons and because smaller instances may not have enough memory to run all of the necessary programs.
- By default Amazon only allows you to launch 20 instances at a time. To increase this limit, fill out this form, and request 100 m4.xlarge instances in US West (Oregon). For the use case, write something like “For our class project, we want to reproduce the results of the Spark paper (Resilient Distributed Datasets). This will require us to run 100 on-demand xlarge instances.” It takes at least a day for this request to be approved, so start early!
- Clone our git repo by running
git clone https://github.com/viveksjain/repro_rdd.git.
- Go to this page, expand the section “Access Keys (Access Key ID and Secret Access Key),” click “Create New Access Key,” and download the key file.
- From the repo directory, run
python run.py <AWSAccessKeyId> <AWSSecretKey>
- Wait for the script to complete. The script will automatically download the timing data after it has been generated.
- Open the IPython notebook
plot.ipynbinside the repo. You will need IPython, plotly and numpy packages installed. Running all the cells inside the notebook should regenerate the graphs using your data.
- Dean, J., Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. OSDI’04.
- Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., … Stoica, I. (2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. USENIX Association Berkeley.