Scaling Apache Spark is typically the last step before executing a Spark-dependent workflow. In previous articles, we introduced Spark, and showed how to optimize it. Once correctly optimized, scaling Apache Spark becomes trivial.
To demonstrate, we return to the NYC taxi dataset originally described here. As of 2019, this dataset contains about 1.5 billion anonymized taxi trips. Each record contains details like the trip duration, fare amount, and the passenger count. Additional details pertinent to this analysis are the geographical coordinates of pickup locations, plotted in red below. The cyan lines represent distinct geographical taxi zones. Interact with the map to appreciate the scale and density.
Now, we have a simple goal. We want to identify within which taxi zone each pickup location lies. Given a data point:
[-73.953804016113281, 40.788127899169922]
and a geographical taxi zone:
{ ... [ -73.848597000000183, 40.871670000000115 ], [ -73.845822536836778, 40.870239076236174 ], [ -73.854559184633743, 40.859953835764252 ], [ -73.854665433068263, 40.859585694988056 ], [ -73.856388703358959, 40.857593635304482 ]...
how is a computer supposed to determine whether the given data point lies within the taxi zone?
This is conceptually simple, but computationally expensive. A young child can easily deduce that the red point is outside the polygon, while the blue point is within the polygon. Our challenge is getting an algorithm to determine this accurately and cheaply (quickly). Once we find a suitable solution, we can scale the solution to handle larger datasets.
Initially, we compared how fast several open-source software packages processed a subset of the data. We found that the combination of Magellan and Spark, running on a laptop, was able to process 12 million records in 1.1 minutes. Processing a larger dataset on the same laptop might take a prohibitively long time. Thus, we need to scale up our analysis platform, and Apache Spark is perfectly suited for this. An important intermediate step, however, when moving from a small test setup to a full-scale analysis, is to properly optimize Apache Spark parameters. We covered that here.
Given that we were able to process 12 million records in 1.1 minutes on a laptop, we should be able to process 1.2 billion points in roughly two hours on the same laptop. But what if we need the results sooner? To start with, we scale up to a cloud-based server. We choose an AWS a1.4x instance, with 16 cores and 32 GB memory. Next, we scale up our dataset from 12 million points to 1.2 billion points. We display the results below.
Platform | Cores | Memory | Data points | Duration (minutes) |
---|---|---|---|---|
Laptop | 2 x 2.5 GHz | 8 GB | 12 million | 1.1 |
Cloud instance | 16 x 2.3 GHz | 32 GB | 1.2 billion | 18 |
What if 18 minutes is still too long to wait for the results? What if we needed the results in 5 minutes? One option would be to move from the single node to a cluster, say of 4 identical nodes. Even though Spark jobs don’t scale entirely linearly, we should be able to process the 1.2 billion points in 5 minutes on such a cluster.
Summary
Apache Spark is one of the tools analysts use to take “big” out of “big data.” In a series of posts, we showed how to move from a simple test case, to optimizing parameters, to scaling Apache Spark. We ran a computationally expensive algorithm, processing 1.2 billion points in 18 minutes, and suggested ways of reducing the processing time even further.