Scaling Apache Spark: 1.2 billion data points in 18 minutes

March 18, 2019Leave a comment

Scaling Apache Spark is typically the last step before executing a Spark-dependent workflow. In previous articles, we introduced Spark, and showed how to optimize it. Once correctly optimized, scaling Apache Spark becomes trivial. To demonstrate, we return to the NYC taxi dataset originally described here. As of 2019, this dataset contains about 1.5 billion anonymized [...]

Big data analytics with Apache Spark and Hadoop

Apache Spark and Hadoop in big data analytics

November 6, 2018March 18, 2019Leave a comment

Increasingly, data analysts turn to Apache Spark and Hadoop to take the "big" out of "big data." Typically, this entails partitioning a large dataset into multiple smaller datasets to allow parallel processing. In this previous post, we explained how distribution enables analysis of datasets that are too large to fit in memory on a single [...]

Get down the mountain, quickly!

June 16, 2017July 6, 2017Leave a comment

You are standing on the side of a steep mountain. You need to descend to the base of the mountain as quickly as possible. Remarkably, this scenario illustrates a central concept in machine learning. But let's get back to the mountain. I'd imagine that the first thing you would do, almost intuitively, would be to [...]

Teach your child to count – the machine learning way.

April 24, 2017May 15, 2017Leave a comment

What is machine learning? I could bore you with textbook definitions. Instead, let me use a familiar example. A few days ago, I was teaching a child how to count. This is what transpired: Child: 1, 2, 3, 4, 5, 3, 8... Me: Stop. 1, 2, 3, 4, 5 is correct, but what comes after [...]

Artificial intelligence machine learning image

Is Artificial Intelligence an existential threat to humanity?

March 24, 2017March 31, 2017Leave a comment

The masterfully scripted Ex-Machina is a slow-burning, cerebral thriller which subtly exposes moral and ethical questions surrounding Artificial Intelligence (AI). Indeed, I am yet to watch a better movie on the subject. Ex-Machina (literally, "from the machine") revolves around three characters. First, we meet Caleb, an exceptional computer programmer. He works at Bluebook, a company [...]

Machine Learning

and other tech stuff

Tag: Non-technical

Scaling Apache Spark: 1.2 billion data points in 18 minutes

Apache Spark and Hadoop in big data analytics

Get down the mountain, quickly!

Teach your child to count – the machine learning way.

Is Artificial Intelligence an existential threat to humanity?

Share this:

Share this:

Share this:

Share this:

Share this: