You are standing on the side of a steep mountain. You need to descend to the base of the mountain as quickly as possible. Remarkably, this scenario illustrates a central concept in machine learning.
But let’s get back to the mountain. I’d imagine that the first thing you would do, almost intuitively, would be to orient yourself and start walking down–not up, or across–the mountain. Logically, to descend to the base of the mountain, one would walk in the direction of steepest descent.
THE COST FUNCTION
In our last post, we introduced the concept of a cost function. Recall, in machine learning, we feed a machine a large amount of data, with the goal that the machine will learn the relationships that govern the data. Having learned the relationships, the machine can make predictions about future data. The cost function is a tool we use to evaluate how well the machine has learned.
Let’s get graphical. The figure below depicts an arbitrary cost function. Think of the surface as measuring the error, or difference, between some “correct” value, and what the machine is outputting. Large values of the cost function (red, orange) indicate the machine is outputting values with a large error. Conversely, small values of the cost function (blue) indicate low errors—the machine is learning better. The absolute minimum of the cost function (point B, where the error is smallest) represents the machine outputting the “correct” values. Ideally, we want the machine to output values close to B. In machine learning parlance, we need to “minimize the cost function.”
THE FASTEST WAY DOWN
Now, just as you were stuck on the side of a steep mountain, imagine that the machine is outputting values that place it at point A on the cost function. As we have seen, point A (large error) represents values that are far from optimal. Both you and the machine need to move to the base as fast as possible.
If you remember your elementary calculus — wait, what did you say? You don’t remember your elementary calculus? OK, never mind, just pay close attention. In calculus, there is a magical quantity called the gradient. Calculate the gradient of a function, and it tells you what direction the function is increasing fastest. The direction opposite the gradient is the steepest descent. Please re-read the last two sentences before I attempt to tie everything together.
You find yourself on the side of a mountain. Likewise, the machine finds itself on the side of a cost function (point A). You both want to get to the base, as fast as possible. You orient yourself, find which way is down, and start walking, making periodic corrections. The machine does something similar: it calculates the gradient, the direction of steepest ascent. It then takes a step in the direction opposite the gradient, thereby descending at the fastest rate. After the initial step, it repeats the process (calculate the gradient, take a step opposite the gradient), until it ends up at the base, B. At this point, there is minimal error between its predictions and the correct values. Voila, the machine has learned.
Gradient descent, as the process is called, is a fundamental and widespread technique used in machine learning. The following animations show different variants of gradient descent, but they all share a common theme: given a cost function, move in the direction of steepest descent until the minimum is obtained.
Of the optimization methods above, which is the “best?” They all follow different routes toward the minimum of the Rosenbrock function. Therefore, it is tempting to select, for example, the Powell method as the “best.” In practice, there are many factors, such as computational expense (read, time and money), accuracy, and robustness, that affect many decisions in machine learning. This idea of trade-offs is one we will return to often.