Gradient Descent: A Beginner-Friendly Guide to How Models Learn

December 9, 2025

Most modern ML models—from simple regressions to deep neural networks—learn using the core idea i.e. Gradient Descent. It’s an optimization method that adjusts model parameters gradually to minimize the error, very similar to like walking downhill till you reach the lowest point of a hill.

Rolling ball down the hill – An analogy

Imagine this – you are standing on a hazy mountain and want to reach the bottom of the valley. You are only able to see the ground right around your feet. You can feel which direction slopes downward and take a small step that way, then you keep repeating this process again. This is what Gradient Descent exactly does: at every step, it calculates at the “slope” of the loss function and then moves the parameters of the model in the direction that minimises the loss.

In this example:

Landscape on the example = The loss function (or the error of the model).
Your position above = the current model parameters (called weights).
Slope under your feet = Gradient.
Each step downhill = Gradient descent update.

Repeating these small steps again and again, you reach closer to minimising the loss, leading to improving the model’s predictions.

Core update rule (without heavy math)

The goal of the gradient descent is to minimize the Loss function , where indicates the model parameters. Gradient descent updates parameters using:

is the gradient i.e. the direction of steepest increase of the loss.
(eta) is the learning rate, which controls how big each step is.

Because the gradient points to uphill, subtracting it will move us downhill thus reducing the loss over time.

Different types of gradient descent

In practice, there are commonly three types of Gradient Descent variants which differ in the way they use data per step.

Batch Gradient Descent
- Uses the complete training dataset and compute gradient for each update.
- Pros: Smooth convergence, stable path.
- Cons: Slow and memory-heavy for large datasets.
Stochastic Gradient Descent (SGD)
- Uses one training example at a time to update parameters.
- Pros: Updates very fast and can escape shallow local minima.
- Cons: Path is noisy and loss curve zigzags instead of smoothly decreasing.
Mini-batch Gradient Descent
- Use small batches (e.g., 24, 32, 64, 128 data points) for each update.
- Pros: This is a balanced approach of speed and stability and usually it is a default choice in deep learning.

Choosing the correct variant depends on size of the dataset and hardware, but mini-batch most commonly used in the real-world training.

The learning rate (LR)

The learning rate is one of the important hyper-parameters in gradient descent.

Too high LR: Means the algorithm takes huge steps, overshoots the minimum, and may diverge (loss explodes instead of decreasing).
Too low LR: Means the algorithm is taking tiny steps, converging very slowly, and it may get stuck in flatter regions.

A more realistic approach is to start with a moderate learning rate (e.g., 0.01 or 0.001) and keep monitoring the loss curve over various training epochs. If there is a jump in the loss or it increases , significantly, we need to reduce the learning rate; if it decreases very slowly, we should consider increasing the learning rate slightly or use the learning rate schedule.

Adaptive optimizers (some examples are Adam, RMSprop, and AdaGrad) automatically adjust the effective learning rates for the parameter, improving the convergence in the deep networks.

Simple real-world examples

Few scenarios which make gradient descent tangible even more:

Linear regression for house prices

- Inputs: Size, number of rooms, location features.
- Output: Price.
- Gradient descent adjusts the weights on each of the features so that the predicted prices get closer to actual prices, minimizing mean squared error (MSE).

Logistic regression for spam detection

- Inputs: Email features (e.g., word frequencies, presence of certain terms).
- Output: Spam or not spam.
- Gradient descent optimize the parameters so that the model correctly classifies emails and minimize the classification loss like cross-entropy.

Convoluted Neural network for image classification

- Inputs: Pixel values.
- Output: Class label (e.g., cat vs dog).
- Backpropagation computes the gradients layer by layer, and the gradient descent updates millions of weights reducing the overall loss (error).

Even though the models might differs, but the optimization engine underneath is still gradient descent (or a variant of it).

What are the common pitfalls for gradient descent and how you can avoid them

When starting with gradient descent, watch out for these issues:

Poor scaling of features

- When features are measured on very different scales (e.g., age vs. income), the loss surface can becomes skewed slowing the convergence.
- Fix: Applying the normalization or standardization to features.

Getting stuck or oscillating

- The loss does not decrease smoothly or gets stuck in plateaus.
- Fix: Try a different learning rate, mini-batches, momentum, or an adaptive optimizer like Adam.

Overfitting

- Even with good optimization, the model may overfit training data.
- Fix: Use regularization (L2, dropout), early stopping, or more data; gradient descent will then optimize a regularized loss that better generalizes.