Taking the “big” out of “big data”


Currently, the term “big data” is tossed around with increasing frequency. Pundits present big data as a pressing problem in analytics, the solving of which is critical to increased profitability. But what is big data?

From an analytics perspective, the term “big data” is relative. One analyst’s big data is another analyst’s data. Let’s illustrate this using a well-known big dataset. The city of New York has allowed open access to an anonymized dataset of taxi trips from 2009 to present. Each record contains details like the pickup and dropoff locations, pickup and dropoff times, trip duration, fare amount, tip amount, and passenger count. The dataset is fascinating, offering insights into the lives of a subset of New Yorkers. As of 2018, there are 1.5 billion records.

Each month contains 15 million records, give or take a few million. Meditate upon this for a moment: on average, New Yorkers take almost half a million taxi rides a day. This is an insane number of trips. In the figure above, about 10 million taxi pickup locations from March 2016 are displayed on a map/satellite hybrid of New York City. Zoom in to observe street-level data. Note that pickup locations might show up in buildings and other structures since GPS is only accurate to within several meters.

An average month weighs in at around 2 GB; an average year, 25 GB; the entire dataset, 300 GB. We can now qualify our earlier statement about big data being relative. Ingesting the entire NYC dataset in one sitting requires a machine with more than 300 GB of memory–and that is before you perform any analysis. An analyst with access to servers like these might not consider the NYC dataset “big data.” They might be able to load and process the entire dataset in memory. But where does that leave the rest of us, who don’t have access to such high-powered computing? Is there a way we can take the “big” out of “big data?”

Fortunately, the answer is yes. In subsequent posts (here, here), we illustrate ways to analyze datasets that are too large to fit in memory on a single machine. Such “out-of-core” (OOC) analysis is not new, but is increasingly relevant in the era of big data. Most of these solutions are based on partitioning the dataset into smaller chunks and processing the individual chunks in a distributed manner. For example, one could partition the NYC dataset into 20 smaller datasets and simultaneously process each of the smaller datasets on 20 individual machines. Alternatively, one could conceivably partition the dataset into 20 smaller datasets and process each dataset sequentially on the same machine.

Stay tuned for posts detailing these methods. For now, let’s glean some insights from just one month (March 2016) of the NYC taxi dataset.

Big data processing

In terms of potential customers, what is the best time for a taxi driver to be out and about? The worst?

In the figure above, we bin the monthly data by hour of day. As expected, ridership drops off in the wee hours. Also expected is the surge in ridership during the morning and evening rush hours. Clearly, the busiest times occur in early evening.

The heatmap above displays tips as a percentage of taxi fare. 8 million points are used to generate the map. In what locations do riders give large tips as a percentage of the taxi fare? Zoom in for more granular data.

In conclusion, the term “big data” is relative to an analyst’s tools. There are ways to strip the “big” from “big data,” subject to some limitations. We will detail these tools in upcoming posts:



SOFTWARE: Linux, python, dask, pandas, geopandas, geojson, shapely, mapbox, tippecanoe
OTHER DETAILS: Dask allows for parallel processing of the data, since the March 2016 dataset (2 GB) does fit in memory, but allows very limited processing on an 8 GB system. We generate the maps in Mapbox and style them across the data and zoom ranges. Tippecanoe builds the map tilesets into scale-independent views.