The Promise: Why We Love Early Stopping
First, a quick refresher on why early stopping is in every ML engineer’s toolkit. When you train a deep learning model, its performance on the training data will almost always improve with more epochs. But its performance on new, unseen data (the validation
set) is a different story. At first, validation performance improves, but eventually, the model starts to memorize the training data's noise and quirks. This is overfitting, and it's the point where the model becomes less useful in the real world. Early stopping is the simple, brilliant solution: monitor the model’s performance on a validation set and stop the training process when that performance stops improving. It saves you from running countless unnecessary epochs, saving time, money, and computational resources. It feels automatic, like a set-it-and-forget-it safeguard against one of machine learning’s biggest pitfalls. On the surface, what could possibly go wrong?
The Common Misunderstanding
Here’s how most engineers envision the process. The model trains, epoch after epoch. The validation loss is tracked. At epoch 50, let’s say, the model hits its lowest validation loss yet—its peak performance. Then, at epoch 51, the loss is a little worse. The assumption is that early stopping immediately slams the brakes, and the model you’re left with is the stellar one from epoch 50. This is the logical conclusion, but it’s not how most standard implementations actually work by default.
Instead, they incorporate a concept called “patience.” The patience parameter tells the training loop not to stop at the first sign of trouble. It waits for a specified number of epochs (e.g., a patience of 10) to see if the model can recover and find an even better minimum. This is a good thing; it prevents stopping prematurely due to random fluctuations. But it also creates a subtle trap.
The Hidden Detail: The Patience Trap
This is the detail that gets skipped: when early stopping is finally triggered after the 'patience' period runs out, the model you have is not the one from its best-performing epoch. It's the model from the *final* epoch of training. Let’s revisit our example. The best model was at epoch 50. With a patience of 10, the training continues. The validation loss gets a little worse at 51, 52, and so on. At epoch 60, the patience runs out, and training halts.
The model state you have in memory is the one from epoch 60—a model that is, by definition, 10 epochs deep into overfitting and measurably worse than the one from epoch 50. You correctly identified that training should stop, but you’ve inadvertently kept the overtrained artifact instead of the peak-performance version. You used a brilliant technique to find the best model, only to throw it away at the last second.
The Fix: Restoring the Best Weights
Thankfully, the fix is usually just a single line of code or a simple boolean flag. Most modern machine learning libraries, like Keras and PyTorch Lightning, have a built-in solution for this exact problem. In Keras, for example, the `EarlyStopping` callback has an argument called `restore_best_weights`. When you set `restore_best_weights=True`, the framework keeps track of the model weights from the epoch with the best validation score. When training finally stops, it automatically restores those peak-performance weights to your model.
If you’re not using a framework with this feature, the manual process is to save your model’s weights to disk every time a new best validation score is achieved. When training concludes, you simply load the last saved checkpoint. It’s an extra step, but it’s the difference between deploying a model that is truly optimized and one that is just 'good enough'—or worse, actively overfitted.

















