Learning from Stochastic Gradient Descent

Illustration of terraced hillside

“Walking down a terraced hillside to the village below, in the dark, and somewhat drunk.” This vivid and humorous analogy, used by Anil Ananthaswamy in his book Why Machines Learn, effectively illustrates the concept of stochastic (meaning a bit random) gradient descent (SGD), a fundamental technique in training neural networks.

Reflecting on SGD, I realized that such mathematical and machine learning problems can offer profound insights into life itself. Let me explain:

In the analogy, the village represents an optimal point in a complex, multi-dimensional space where your model makes the fewest mistakes for the task it's trained to perform. The goal of training is to locate this point within a vast and intricate landscape.

Let’s forget the “drunk” part of the analogy for a moment and instead imagine you need to find this village using a lengthy and cryptic set of instructions.

Would you prefer:

Receiving all the instructions at once, taking considerable time to process them, allowing you to move only 1 km per day?
Getting random parts of the list every hour, enabling faster movement, let's say, 1 km per hour but with less certainty in your direction, especially initially?

Here's the revelation: If the village is nearby, the first approach might suffice. However, if the village is far away, perhaps in another country or continent, the second method is far more effective.

By choosing the second approach, you might appear lost, often heading in the wrong direction and frequently changing course. Yet, with each hour, you gain more information, refining your understanding and gradually figuring out the correct path.

In my analogy, the list of instructions, as you may have already guessed, represents the data. You are the model, learning to minimize errors by navigating toward the village.

In practice, models are employed for problems too complex for traditional algorithms, meaning the village is always distant and situated in a high-dimensional space beyond our visual imagination.

I see SGD as a compelling demonstration of managing complexity: When resources (time to think and act) are limited, complex problems—whether learning a new skill, building something new, or something more philosophically like understanding yourself—are best tackled by taking regular, swift action, even without complete information.

As many have already said, it's about learning by doing, making mistakes, and sometimes looking less than brilliant, or even 'drunk'.