Increasingly, data analysts turn to Apache Spark and Hadoop to take the "big" out of "big data." Typically, this entails partitioning a large dataset into multiple smaller datasets to allow parallel processing. In this previous post, we explained how distribution enables analysis of datasets that are too large to fit in memory on a single [...]