Addressing the balance of the model is one of the most important parts when making a machine-learning model. Checking the overfitting and underfitting of the model brings this balance. These issues are often different based on the model type. Linear regression, tree-based models, and ensemble models may have different overfitting and underfitting problems and causes. Therefore, let’s dive deep into model balancing concepts and focus on resolving issues like overfitting and underfitting.
What is Bias?
In general, when we create a machine learning model, it analyzes the data, identifies patterns, and makes predictions. During the process, these models learn the patterns from the training data and then apply them to the test dataset for prediction. However, while doing so, we may observe that the model is unable to truly capture the relationship in the training data.
As a result, when we make predictions, there is a difference between the predicted values and the actual expected values. This difference, known as bias or error due to bias, occurs because the model fails to learn the underlying patterns effectively. Ultimately, a model can exhibit either high bias or low bias depending on how well it captures these patterns.
High Bias – When a model makes more assumptions and is unable to focus on the important features of the dataset, then it is a high-bias model. Also to note that high-bias models cannot work well with new data. The train and test errors will be high and out of the expected range. This leads to the condition of Underfitting.
Low Bias – Unlike high bias models, here the model captures every noise and small fluctuations in data. This flexibility can lead to the condition called Overfitting. Here the training error is very low but the test error is quite high because the model learns noise and specific patterns in the training data, that are not generalised and donโt apply to the test set.
What is Variance?
As the name suggests, variance is “change”, the change in prediction when the model is trained on different sets of training data. The model with high variance does not fit accurately on data it hasn’t seen before, leading to high error rates on test data despite performing well on training data (Overfitting).
If the model is pretty stable and doesn’t tend to change much when trained on different sets of data then this leads to low variance (Underfitting). The following chart will help you to understand the relation among bias, variance and fitting –
The Bias Variance Trade-off
While making a machine learning model it is very important to keep a perfect balance between bias and variance. This will help to prevent overfitting and underfitting. A simple model with fewer parameters may have low variance and high bias while a model with a large number of parameters will have high variance and low bias. Therefore, achieving a balance between bias and variance errors is necessary, and this balance is known as the Bias-Variance trade-off. To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.
As simple as baking a cake!
Let’s understand the entire concept using a simple example. Imagine baking a cake, where the chef represents the dataset:
Pro Chef – is a complex dataset, and capable of handling intricate patterns.
Amateur Chef – Simple dataset with straightforward patterns.
Now let’s say that the ingredients represent the features or the complexity of the model:
More Ingredients – Complex model
Less Ingredients – Simple model
Now let’s figure out based on our analogy, how the cake(model’s performance) will turn out if we play around with the chef and ingredients –
Chef (Dataset complexity) | Less Ingredients (Simple Model) | More Ingredients (Complex Model) |
Pro Chef (Complex dataset) | Doesn’t extract full potential of the dataset (Underfit) | Fully utilizes the dataset’s potential (Ideal Fit) |
Amateur Chef (Simple dataset) | Suitable and decent results (Good fit) | Too complicated to handle leading to error (Overfit) |
Identifying and Addressing Model Challenges
Linear Models
When a linear model performs exceptionally well on the training set but poorly on the test data then this might be due to overfitting. Because of this, large coefficients in the model weights indicate reliance on specific features.
Example – Suppose a polynomial regression model with a high degree (e.g., degree = 15) trained on data. The model fitting every fluctuation in the training data memorizes the noise, resulting in poor generalization to new data. Hence, To overcome this – Apply L1(Lasso) or L2(Ridge) regulations to penalize large weights and prevent the model from becoming too complex.
Now, here you can use sci-kit learn’s Ridge or Lasso classes :
from sklearn.linear_model import Ridge
model = Ridge(alpha = 1.0)
model.fit(x_train, y_train)
Now, if your model is giving high errors for both training and testing. Fails to capture relationships between input and output. This is due to underfitting. To resolve this, you can reduce the regularization strength. For example – decrease the alpha parameter in Ridge or Lasso regularization.
Tree-Based Models
In models like decision trees, overfitting can be identified when deep trees with multiple branches capture noise in the data. As a result, the training accuracy reaches nearly 100%, while the test accuracy drops significantly. For example, a decision tree that continues splitting the data until each leaf contains a single point may fit the training set perfectly. However, this level of complexity prevents the model from generalizing well to unseen data. Consequently, the model performs poorly on the test set, demonstrating the classic signs of overfitting.
Now, here techniques like ensemble methods can be used to rectify overfitting. Random forest can be used to combine multiple tress, it reduces variance by using multiple uncorrelated trees. The following code will help you understand the concept of Random Forest Classifier-
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)
When the tree is too shallow to capture patterns and both training as well as testing errors are high, it is an underfitting. To solve underfitting, allow the tree to grow deeper by increasing the max_depth. We will further see the Ensemble Models and learn how they work for overfitting and underfitting.
Ensemble Models
Models like Random forest, and gradient boosting are a few examples of ensemble models. These can overfit when the base learners are too complex or too many trees are added. Training error becomes significantly lower than testing error. The solution to this is quite simple, we can reduce the tree depth of individual trees. Hence, boosting methods, lowering the learning rate will help. For example – XGBoost
from xgboost import XGBRegressor
model = XGBRegressor(max_depth=4, learning_rate=0.1, n_estimators=100)
model.fit(X_train, y_train)
On the other hand, to avoid underfitting, you can allow deeper trees and increase the learning rate to enable faster training. Additionally, incorporating feature engineering techniques can help increase the diversity in the data, thereby improving the model’s ability to learn meaningful patterns.
General tips for checking Overfitting and Underfitting
In general, to assess the model and check for overfitting and underfitting we may use the following methods –
1. Learning Curves
Learning curves, therefore, help us to visualize how a model’s performance changes as a function of its complexity or the amount of training data. Specifically, it plots training error and the validation error against factors like model complexity. Furthermore, here is a detailed Python code to plot the curve –
import numpy as np
from sklearn.model_selection import learning_curve
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
from sklearn.model_selection import ShuffleSplit
# Example model
model = Ridge(alpha=1.0)
# Generate synthetic data or use your dataset
X = ... # Feature matrix
y = ... # Target variable
# Create learning curves
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, scoring="neg_mean_squared_error"
)
# Calculate mean errors
train_errors = -np.mean(train_scores, axis=1)
test_errors = -np.mean(test_scores, axis=1)
# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_errors, label="Training Error", color="blue")
plt.plot(train_sizes, test_errors, label="Validation Error", color="orange")
plt.legend()
plt.xlabel("Training Set Size")
plt.ylabel("Error")
plt.title("Learning Curves")
plt.show()
If the validation error is significantly higher than the training error, this indicates that the model is overfitting. On the other hand, if both errors are high and close to each other, this suggests that the model is underfitting. To achieve a well-balanced model, both errors need to converge to a relatively low value.
But learning curves are best understood when we visualize them. Here’s how to generate a learning curve plot with a tree-based model using Scikit-learn. This example uses the popular Breast Cancer dataset (built into Scikit-learn) and a Decision Tree Classifier.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Create a Decision Tree Classifier
model = DecisionTreeClassifier(max_depth=5, random_state=42)
# Generate learning curve data
from sklearn.model_selection import ShuffleSplit
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, scoring="accuracy", train_sizes=np.linspace(0.1, 1.0, 10)
)
# Calculate mean and standard deviation for training and test scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
# Plot learning curve
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label="Training Accuracy", color="blue", marker="o")
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="blue", alpha=0.2)
plt.plot(train_sizes, test_mean, label="Validation Accuracy", color="orange", marker="o")
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="orange", alpha=0.2)
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy")
plt.title("Learning Curve: Decision Tree Classifier")
plt.legend(loc="best")
plt.grid()
plt.show()
Upon analyzing this plot, we can draw the following conclusions:-
If the plot shows a wide gap between the training and validation curves, it points towards overfitting. If both curves are low and converge, it indicates underfitting. If both curves are high with a small gap, it indicates a balanced model.
Overfitting is characterized by high training accuracy and low validation accuracy with a wide gap. Underfitting is characterized by low accuracy on both training and validation sets, with the curves converging at low performance levels. A Balanced Model has both high training and validation accuracy, with minimal gap between them, indicating good generalization.
2. Cross- Validation
Cross-validation is, indeed, a robust technique to evaluate a model’s performance; moreover, it helps to detect overfitting or underfitting by splitting the data into multiple training and validation subsets.
It provides a better estimate of model performance by testing it on multiple subsets of data. It helps to avoid reliance on a single train-test split. Read more about cross-validation on our page.
Conclusion
Balancing overfitting and underfitting in a machine learning model is addressed. Overfitting is seen when a model is too complex, capturing noise in the data, while underfitting is observed when a model is too simple to capture the underlying patterns. The key to finding the right balance, therefore, is understood by the Bias-Variance Trade-off.
Techniques like regularization for linear models, ensemble methods for tree-based models, and cross-validation are, therefore, used to manage overfitting and underfitting. Furthermore, model performance and diagnosing fitting issues are visualized using learning curves. Ultimately, the goal is for a model to be built that generalizes well to unseen data, thereby providing reliable predictions. This blog enhances all the perspectives for overfitting and underfitting.
2 responses to “Overfitting vs. Underfitting: How to Optimize your Machine Learning Model”
[…] To learn more about optimizing machine learning models, check out our blog on Overfitting vs. Underfitting: How to Optimize your Machine Learning Model. […]
[…] to dive deeper into this topic? Check out our post on Overfitting vs. Underfitting: How to Optimize your Machine Learning Model to get insights on balancing model complexity for optimal […]