Scaling Apache Spark is typically the last step before executing a Spark-dependent workflow. In previous articles, we introduced Spark, and showed how to optimize it. Once correctly optimized, scaling Apache Spark becomes trivial. To demonstrate, we return to the NYC taxi dataset originally described here. As of 2019, this dataset contains about 1.5 billion anonymized [...]
Category: Geospatial
Apache Spark and Hadoop in big data analytics
Increasingly, data analysts turn to Apache Spark and Hadoop to take the "big" out of "big data." Typically, this entails partitioning a large dataset into multiple smaller datasets to allow parallel processing. In this previous post, we explained how distribution enables analysis of datasets that are too large to fit in memory on a single [...]