Big data analytics with Apache Spark and Hadoop

Apache Spark and Hadoop in big data analytics

Increasingly, data analysts turn to Apache Spark and Hadoop to take the “big” out of “big data.” Typically, this entails partitioning a large dataset into multiple smaller datasets to allow parallel processing.

In this previous post, we explained how distribution enables analysis of datasets that are too large to fit in memory on a single machine. In this post, we compare different solutions for performing distributed geospatial analysis. Let’s define the test case first.

We continue working with the New York City taxi data from the previous article. For this test, we once again use the 12.1 million trips from March 2016, clocking in at 2 GB. Recall that one of the attributes recorded for each trip is the pickup latitude and longitude. A red dot on the map below represents one such pickup location. Zoom in to appreciate the scale and density.

 

 

New York City is subdivided into distinct geographic taxi zones; we plot the outlines of the 263 zones using cyan lines. The computer, however, doesn’t “see” pretty maps like we do. To the computer, a pickup latitude/longitude pair is just a pair of numbers:

[-73.953804016113281, 40.788127899169922]

Similarly, a collection of latitude/longitude pairs defines the boundaries of the taxi zones:

{ ... [ -73.848597000000183, 40.871670000000115 ], [ -73.845822536836778, 40.870239076236174 ], [ -73.854559184633743, 40.859953835764252 ], [ -73.854665433068263, 40.859585694988056 ], [ -73.856388703358959, 40.857593635304482 ]...

Now, we want to perform a conceptually simple analysis. Our goal is to determine within which of the 263 zones each of the 12 million pickup locations lies. Ignoring the scale of the data, this is a relatively straightforward problem from a human perspective. In the image below, even a child can deduce that the red point is outside the polygon, while the blue point is within the polygon. The challenge is getting a computer to determine this accurately, at scale, and as fast (cheaply) as possible. How will the computer, presented with a data coordinate (a pair of numbers), and a geographical zone (a collection of coordinates), determine if the data point falls within the zone?

Apache spark hadoop big data point in polygon

Determining whether a point is in a polygon is conceptually simple but computationally expensive. (Image credit.)

 

One approach, the “point-in-polygon” algorithm, is conceptually simple, yet computationally expensive. The algorithm is an example of a spatial join. In general, a spatial join combines properties of one feature with a second feature, based on a spatial relationship. Specifically, our spatial join will join (or compare) the coordinates of the pickup locations to coordinates of the taxi zone boundaries, and group the pickup locations within the individual taxi zones. We perform the spatial join using several open-source software packages and compare the results below.

 

SOFTWARE PACKAGES

These software packages all implement a spatial join. For each combination, we ran the algorithm three times and averaged the time it took to perform the spatial join.

  1. Geopandas and Shapely running in a non-distributed environment.
  2. Geopandas and Shapely running in Dask, a distributed environment.
  3. Apache Spark  and GeoSpark.
  4. Apache Spark and Magellan.

 

RESULTS

SoftwareTime (minutes)
Geopandas + Shapely (standalone)Crash
Geopandas + Shapely, distributed using Dask35
GeoSpark (rdd) + Spark6.5
GeoSpark (dataframe) + Spark3.4
Magellan + Spark1.1

 

Evidently, Magellan running on Apache Spark is best suited to perform the spatial join, taking slightly over a minute to process 12 million points. At the same rate, Magellan would take almost 2.5 hours to process the 1.5 billion trips in the entire NYC taxi dataset. The other methods show decreasing performance, and the only non-distributed software suite did not complete the analysis. Granted, we performed this test on a personal computer, whereas specialized hardware would greatly increase the performance. Nevertheless, this was the whole point of the exercise: to demonstrate how an analyst without access to expensive servers can take the “big” out of “big data.”

As discussed earlier, the result of the spatial join is a map showing the taxi zone that each pickup location is in. In the map below, we employ data-driven styling to plot the different zones in different colors. Interact with the map through zooming and dragging.

 

OTHER USES OF SPATIAL JOINS

Spatial joins are widespread in geospatial data mining and analysis. Spatial joins can answer questions like:

  • In what part of the city do most of my customers live?
  • Which products sell the best in different locations?
  • Where is the {hospital, cell tower, mall, bank} closest to my {home, office}?
  • Find all the restaurants within 10 km of my house which have an average rating of 4/5 or better.
  • Find the best path from point A to point B that stays as far away from feature C as possible?

In conclusion, we introduced a frequently used geospatial concept, the spatial join. We timed how long it took different software packages to execute the spatial join. The results demonstrate that it is possible for an analyst without access to expensive servers to strip the “big” from “big data.” Magellan software in combination with Apache Spark provided the optimal solution. In our next post, we show how to optimize Apache Spark.



TECHNICAL DETAILS:

SOFTWARE: Scala-based: GeoSpark, Magellan, sbt. Python-based: Geopandas, Shapely, Dask, pandas, geojson. Others: mapbox, tippecanoe.

OTHER DETAILS: The test setup was Ubuntu 16.04 on a virtual machine, 8 GB memory, 2 Pentium i5 2.5 GHz virtual processors.

Leave a Reply