Twitter analytics: 1.2 million Kenyan tweets

March 27, 2019April 30, 20191 Comment

Twitter analytics is an integral part of leveraging social media today. Among certain demographics in Kenya, Twitter is by far the social media platform with the most interactions. Therefore, any company engaged in selling must realize that social media is an additional sales channel. Consequently, a well-devised marketing strategy will lead to a larger market footprint, higher [...]

Scaling Apache Spark: 1.2 billion data points in 18 minutes

March 18, 2019Leave a comment

Scaling Apache Spark is typically the last step before executing a Spark-dependent workflow. In previous articles, we introduced Spark, and showed how to optimize it. Once correctly optimized, scaling Apache Spark becomes trivial. To demonstrate, we return to the NYC taxi dataset originally described here. As of 2019, this dataset contains about 1.5 billion anonymized [...]

Optimize Apache Spark and Hadoop in big data analytics [Part 2] [Advanced]

March 16, 2019March 16, 2019Leave a comment

One often sees questions in forums asking why, for a particular Spark job, certain configurations outperform others. A naive understanding of Spark might imply that increasing the number of executors or increasing the cores per executor will lead to faster job completions. This is wrong. In this post, we show how to optimize Apache Spark. Faster execution [...]

Big data analytics with Apache Spark and Hadoop

Apache Spark and Hadoop in big data analytics

November 6, 2018March 18, 2019Leave a comment

Increasingly, data analysts turn to Apache Spark and Hadoop to take the "big" out of "big data." Typically, this entails partitioning a large dataset into multiple smaller datasets to allow parallel processing. In this previous post, we explained how distribution enables analysis of datasets that are too large to fit in memory on a single [...]

Machine Learning

and other tech stuff

Category: Apache spark

Scaling Apache Spark: 1.2 billion data points in 18 minutes

Optimize Apache Spark and Hadoop in big data analytics [Part 2] [Advanced]

Apache Spark and Hadoop in big data analytics

Share this:

Share this:

Share this:

Share this: