Let’s imagine you’re doing research on an ideal rental property. You gather your data, open up your favorite programming environment and you get to work on perform Exploratory Data Analysis (EDA). During your EDA, you find some dirty data and clean it to train on. You decide on a model, separate the data into training, validation, and testing, and train your model on the cleaned data. Upon evaluating your model using some validation and test data, you notice that your validation error is very high as well as your test error.
Now suppose you pick a different model or add additional features. Now your validation error is much lower. Great! However, upon using your testing data, you notice that the error is still high. What just happened?
Overfitting vs Underfitting
The above cases resulted in inaccurate models.
In the first case, you ended up with a high validation error and test error. This case is known as underfitting, also known as having high bias. In underfitting, your model literally ignores the data points, much like a myopic person.
In the second case, you ended up with a low validation error and a high test error. This case is known as overfitting, also known as having high variance. In overfitting, your model is constrained as it tries to correctly predict every single test case that it was provided for training.
The image below gives a visual understanding on underfitting and overfitting.
Why are both extremes bad?
Unsurprisingly, we want to prevent both of these cases from happening. Both of them, however, have different reasons on why they’re bad.
In the case of underfitting, our model doesn’t accurately reflect our data. Thus, when we try to predict future data, we’ll end up giving incorrect results. It can often be due to using too few features in training or a model that cannot handle complex data. For example, using linear regression against data that is not linear. Of the two, addressing underfitting can be easier to deal with since the user will usually know quickly if their model is inaccurate.
In the case of overfitting, our model does accurately reflect our data. However, the model no longer covers the generality of the dataset. Thus, we can end up models that handle future data poorly. Overfitting can be due to a way too complex model. Unfortunately, detecting overfitting isn’t always easy to do as you can end up with a good model until it no longer works.
How to fix overfitting and underfitting?
Even though getting the perfect model is nearly impossible, there are ways to tweaks models to reach a happy medium.
To address underfitting, you can do the following:
Try a different machine learning algorithm (for non-linear data, don’t use linear regression).
Add additional features to train on. This approach will increase your model complexity.
If using regularization, try decreasing your lambda parameter.
In the case of neural networks, create a larger neural network.
To address overfitting, you can do the following:
Add more data to your dataset. The additional data should provide a more generalized dataset, thus reducing variance.
Simplify your model complexity by removing features for training.
If already using regularization, increase the lambda parameter.
In the case of neural networks, create a smaller neural network.
Remember that, like in everything, tweaking your models is a fine balancing act and you might need to utilize several techniques.
Here are a few sources for those that want a more indepth discussion on overfitting and underfitting: