Table of Contents
Image Classification with CNNs Using Three Architectures of Increasing Complexity¶
Introduction¶
Image classification is one of the most important tasks in modern machine learning. It allows computers to automatically recognize and categorize images. Applications include medical imaging, autonomous vehicles, surveillance systems, and many others.
In this article, we will:
build three CNN architectures of increasing complexity,
train them on a real image dataset,
compare their performance,
analyze how architecture complexity and hyperparameters affect results.
The Data¶
We’re going to use the CIFAR-10 dataset, which is:
publicly available,
commonly used in research,
composed of 60,000 color images of 32×32 pixels,
divided into 10 classes (airplane, automobile, bird, cat, etc.).
PyTorch provides CIFAR-10 directly, so no manual download is required.
We’re going to need some libraries, so let’s import them:
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import torch.optim as optim
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import torchvision
import torchvision.transforms as transforms
Let’s also define the computation device:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
device(type='cuda')
Neural networks are sensitive to input scale and distribution. Normalization stabilizes training, and augmentation helps prevent overfitting.
Let’s compose some transformations for flipping, cropping, convertion to tensors and normalization:
transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
Now, we can load the dataset and populate the training set and the test set:
trainset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transform_train
)
testset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform_test
)
100%|██████████| 170M/170M [00:02<00:00, 62.2MB/s]
The Three CNN Architectures¶
We’re going to use three different neural network architectures with varying levels of complexity:
Architecture 1 – Low Complexity Model
A simple convolutional network characterized by:
a small number of layers,
a small number of filters,
basic pooling layers.
This model serves as a baseline model (i.e. reference point) and allows assessment of minimal classification performance.
Architecture 2 – Medium Complexity Model
An extended CNN containing:
more convolutional layers,
more filters,
regularization mechanisms (e.g., dropout).
The goal of this architecture is to capture more complex image features at a moderate computational cost.
Architecture 3 – High Complexity Model
A more advanced architecture characterized by:
deeper network structure,
larger number of parameters,
stronger regularization.
This model should have the highest representational capacity but also require greater computational resources and more careful hyperparameter selection.
Let’s now code the three neural networks one by one.
Architecture 1 – Low Complexity Model¶
The first model is very shallow. It’s also characterized by the lowest computational cost of the three. We’ll use it as a baseline model. Here’s the code:
class CNN_Baseline(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc = nn.Linear(16 * 16 * 16, 10)
def forward(self, x):
x = self.pool(torch.relu(self.conv(x)))
x = x.view(x.size(0), -1)
return self.fc(x)
Architecture 2 – Medium Complexity Model¶
The second architecture includes two convolutional and pooling layers. We’re going to use the ReLU activation function. Also, in the classifier, we’re going to add a Dropout layer for regularization. Here’s the medium-complexity model:
class CNN_Medium(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2,2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2,2)
)
self.classifier = nn.Sequential(
nn.Linear(64*8*8, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, 10)
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
Architecture 3 – High Complexity Model¶
Finally, the most complex model. Here, we add another convolutional layer and two batch normalization layers:
class CNN_Complex(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2,2),
nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2,2)
)
self.classifier = nn.Sequential(
nn.Linear(128*8*8, 512),
nn.ReLU(),
nn.Dropout(0.6),
nn.Linear(512, 10)
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
Training and Evaluation¶
Let’s create a function to train the model:
def train_model(model, optimizer, criterion, trainloader, epochs=10):
model.train()
for epoch in range(epochs):
running_loss = 0.0
for images, labels in trainloader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
return model
And another one for evaluating it:
def evaluate_model(model, testloader):
model.eval()
all_labels = []
all_predicted = []
with torch.no_grad():
for images, labels in testloader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
all_labels.extend(labels.cpu().numpy())
all_predicted.extend(predicted.cpu().numpy())
accuracy = accuracy_score(all_labels, all_predicted) * 100
precision = precision_score(all_labels, all_predicted, average='macro', zero_division=0)
recall = recall_score(all_labels, all_predicted, average='macro', zero_division=0)
f1 = f1_score(all_labels, all_predicted, average='macro', zero_division=0)
cm = confusion_matrix(all_labels, all_predicted)
return accuracy, precision, recall, f1, cm
The Experiments¶
Let’s run some experiments. We want to compare the three models we defined above using different hyperparameters.
We’ll define different values for learning rates, epochs, batch sizes, and optimizers (SGD, Adam) that will be used in the experiments. This will involve creating lists or dictionaries for each hyperparameter:
criterion = nn.CrossEntropyLoss()
results = []
learning_rates = [0.001, 0.01]
epochs_list = [5, 10]
batch_sizes = [32, 64]
optimizers = {
'Adam': optim.Adam,
'SGD': optim.SGD
}
model_classes = {
'CNN_Baseline': CNN_Baseline,
'CNN_Medium': CNN_Medium,
'CNN_Complex': CNN_Complex
}
For simplicity’s sake, we’ll create a run_experiment function to perform the experiments:
def run_experiment(model_name, model_class, optimizer_name, optimizer_type, lr, epochs, batch_size):
print(f"Running experiment: Model={model_name}, Optimizer={optimizer_name}, LR={lr}, Epochs={epochs}, BatchSize={batch_size}")
# Re-create DataLoaders with current batch_size
current_trainloader = DataLoader(trainset, batch_size=batch_size, shuffle=True)
current_testloader = DataLoader(testset, batch_size=batch_size, shuffle=False)
# Instantiate the model and move to device
model = model_class().to(device)
# Initialize optimizer
if optimizer_type == optim.SGD:
optimizer = optimizer_type(model.parameters(), lr=lr, momentum=0.9)
else:
optimizer = optimizer_type(model.parameters(), lr=lr)
# Train model
trained_model = train_model(model, optimizer, criterion, current_trainloader, epochs=epochs)
# Evaluate model
accuracy, precision, recall, f1, cm = evaluate_model(trained_model, current_testloader)
result_entry = {
'model_name': model_name,
'optimizer': optimizer_name,
'learning_rate': lr,
'epochs': epochs,
'batch_size': batch_size,
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1,
'confusion_matrix': cm
}
return result_entry
Let’s now run the experiments:
print("Starting comprehensive experiments...")
for model_name, model_class in model_classes.items():
for opt_name, opt_type in optimizers.items():
for lr in learning_rates:
for epoch_val in epochs_list:
for bs in batch_sizes:
res = run_experiment(model_name, model_class, opt_name, opt_type, lr, epoch_val, bs)
results.append(res)
print("All experiments completed.")
results_df = pd.DataFrame(results)
print("\n--- Experiment Results Summary ---")
print(results_df.head())
Starting comprehensive experiments...
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.001, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.001, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.001, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.001, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.01, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.01, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.01, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=Adam, LR=0.01, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.001, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.001, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.001, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.001, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.01, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.01, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.01, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Baseline, Optimizer=SGD, LR=0.01, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.001, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.001, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.001, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.001, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.01, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.01, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.01, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=Adam, LR=0.01, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.001, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.001, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.001, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.001, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.01, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.01, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.01, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Medium, Optimizer=SGD, LR=0.01, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.001, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.001, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.001, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.001, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.01, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.01, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.01, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=Adam, LR=0.01, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.001, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.001, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.001, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.001, Epochs=10, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.01, Epochs=5, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.01, Epochs=5, BatchSize=64
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.01, Epochs=10, BatchSize=32
Running experiment: Model=CNN_Complex, Optimizer=SGD, LR=0.01, Epochs=10, BatchSize=64
All experiments completed.
--- Experiment Results Summary ---
model_name optimizer learning_rate epochs batch_size accuracy \
0 CNN_Baseline Adam 0.001 5 32 59.56
1 CNN_Baseline Adam 0.001 5 64 59.50
2 CNN_Baseline Adam 0.001 10 32 62.49
3 CNN_Baseline Adam 0.001 10 64 63.14
4 CNN_Baseline Adam 0.010 5 32 39.89
precision recall f1_score \
0 0.599931 0.5956 0.583806
1 0.603067 0.5950 0.594841
2 0.632285 0.6249 0.619283
3 0.633687 0.6314 0.627454
4 0.417701 0.3989 0.376320
confusion_matrix
0 [[636, 26, 57, 16, 15, 8, 39, 12, 126, 65], [5...
1 [[657, 37, 52, 39, 19, 8, 14, 4, 121, 49], [55...
2 [[622, 29, 19, 28, 24, 2, 8, 7, 214, 47], [45,...
3 [[602, 37, 109, 17, 20, 8, 20, 10, 140, 37], [...
4 [[390, 60, 53, 15, 51, 6, 88, 28, 279, 30], [3...
The Results¶
Let’s now visualize and discuss the results:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Define a helper function to plot metrics
def plot_metric_comparison(df, metric, title_suffix):
plt.figure(figsize=(14, 7))
sns.barplot(data=df, x='model_name', y=metric, hue='optimizer', errorbar='sd')
plt.title(f'Average {metric.replace("_", " ").title()} by Model and Optimizer ({title_suffix})')
plt.ylabel(metric.replace("_", " ").title())
plt.xlabel('Model Architecture')
plt.legend(title='Optimizer')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
# 2. Visualize overall performance by model and optimizer
print("\n--- Visualizing Overall Performance ---")
for metric in ['accuracy', 'precision', 'recall', 'f1_score']:
plot_metric_comparison(results_df, metric, 'Across All Hyperparameters')
# 3. Create a line plot for F1-score of CNN_Medium vs. Learning Rate, Optimizer, and Batch Size
print("\n--- Visualizing F1-Score of CNN_Medium vs. Hyperparameters ---")
plt.figure(figsize=(15, 6))
sns.lineplot(data=results_df[results_df['model_name'] == 'CNN_Medium'],
x='learning_rate', y='f1_score', hue='optimizer', style='batch_size',
marker='o', errorbar='sd')
plt.title('F1-Score of CNN_Medium vs. Learning Rate, Optimizer, and Batch Size')
plt.xscale('log')
plt.ylabel('F1-Score')
plt.xlabel('Learning Rate (log scale)')
plt.grid(True, which="both", ls="--", alpha=0.7)
plt.show()
# 4. Create a line plot for Average Accuracy vs. Number of Epochs Across Models
print("\n--- Visualizing Average Accuracy vs. Epochs ---")
plt.figure(figsize=(12, 6))
sns.lineplot(data=results_df, x='epochs', y='accuracy', hue='model_name', marker='o', errorbar='sd')
plt.title('Average Accuracy vs. Number of Epochs Across Models')
plt.ylabel('Average Accuracy')
plt.xlabel('Epochs')
plt.grid(True, ls="--", alpha=0.7)
plt.show()
# 5. Identify the best performing configuration based on highest accuracy
print("\n--- Best Performing Configuration (Accuracy) ---")
best_accuracy_row = results_df.loc[results_df['accuracy'].idxmax()]
print(best_accuracy_row[['model_name', 'optimizer', 'learning_rate', 'epochs', 'batch_size', 'accuracy', 'precision', 'recall', 'f1_score']])
# 6. Identify the best performing configuration based on highest F1-score
print("\n--- Best Performing Configuration (F1-Score) ---")
best_f1_row = results_df.loc[results_df['f1_score'].idxmax()]
print(best_f1_row[['model_name', 'optimizer', 'learning_rate', 'epochs', 'batch_size', 'accuracy', 'precision', 'recall', 'f1_score']])
# 7. Calculate and print the average performance by grouping by model and optimizer
print("\n--- Average performance by Model and Optimizer ---")
avg_performance = results_df.groupby(['model_name', 'optimizer'])[['accuracy', 'precision', 'recall', 'f1_score']].mean().sort_values(by='f1_score', ascending=False)
print(avg_performance)
# 8. Print the top 5 overall configurations by F1-score
print("\n--- Top 5 Overall Configurations by F1-Score ---")
print(results_df.sort_values(by='f1_score', ascending=False).head()[['model_name', 'optimizer', 'learning_rate', 'epochs', 'batch_size', 'accuracy', 'f1_score']])
--- Visualizing Overall Performance ---
--- Visualizing F1-Score of CNN_Medium vs. Hyperparameters ---
--- Visualizing Average Accuracy vs. Epochs ---
--- Best Performing Configuration (Accuracy) ---
model_name CNN_Complex
optimizer SGD
learning_rate 0.01
epochs 10
batch_size 32
accuracy 76.16
precision 0.77213
recall 0.7616
f1_score 0.757276
Name: 46, dtype: object
--- Best Performing Configuration (F1-Score) ---
model_name CNN_Complex
optimizer SGD
learning_rate 0.01
epochs 10
batch_size 32
accuracy 76.16
precision 0.77213
recall 0.7616
f1_score 0.757276
Name: 46, dtype: object
--- Average performance by Model and Optimizer ---
accuracy precision recall f1_score
model_name optimizer
CNN_Complex SGD 71.36125 0.722253 0.713612 0.711393
CNN_Medium SGD 63.54000 0.636093 0.635400 0.629356
CNN_Baseline SGD 57.10750 0.575145 0.571075 0.564336
Adam 52.38875 0.531430 0.523888 0.514312
CNN_Medium Adam 40.37125 0.358080 0.403713 0.360856
CNN_Complex Adam 38.76875 0.347501 0.387688 0.346080
--- Top 5 Overall Configurations by F1-Score ---
model_name optimizer learning_rate epochs batch_size accuracy \
46 CNN_Complex SGD 0.010 10 32 76.16
42 CNN_Complex SGD 0.001 10 32 75.59
47 CNN_Complex SGD 0.010 10 64 74.54
19 CNN_Medium Adam 0.001 10 64 73.87
43 CNN_Complex SGD 0.001 10 64 73.25
f1_score
46 0.757276
42 0.756337
47 0.738501
19 0.735358
43 0.729258
Conclusion¶
As we can see, there’s a visible impact of hyperparameter configurations on model performance:
Model Architecture: The complexity of the model had a significant impact on performance. As complexity increased from CNN_Baseline to CNN_Medium and then to CNN_Complex, the average performance metrics (accuracy, precision, recall, F1-score) generally improved.
Optimizer: The choice of optimizer played a crucial role. SGD (Stochastic Gradient Descent) consistently outperformed Adam across all model architectures, especially for the more complex models. For CNN_Complex, SGD yielded an average F1-score of 0.711393 compared to Adam’s 0.346080. This suggests that for these architectures and dataset, SGD with momentum was more effective at navigating the loss landscape.
Learning Rate: The learning rate had a varied impact. For CNN_Medium, a lower learning rate (0.001) with Adam and a higher batch size (64) achieved the highest F1-score within its configurations (0.735358). However, for CNN_Complex, the higher learning rate (0.01) with SGD was part of the best-performing configurations.
Epochs: Increasing the number of epochs from 5 to 10 generally led to improved performance across most models and configurations, indicating that the models benefited from more training iterations to converge.
Batch Size: The impact of batch size was also tangible. For the top-performing configurations, both 32 and 64 batch sizes appeared, suggesting that its optimal value might be intertwined with other hyperparameters.
Best-Performing Models and Their Optimal Settings:
The overall best-performing configuration was achieved by the CNN_Complex model with the following settings:
- Optimizer: SGD
- Learning Rate: 0.01
- Epochs: 10
- Batch Size: 32
This configuration achieved an accuracy of 76.16% and an F1-score of 0.757276. This model also secured the top position for both highest accuracy and highest F1-score.
The second best configuration was also CNN_Complex with SGD, LR=0.001, Epochs=10, BatchSize=32, yielding an F1-score of 0.756337.
Interestingly, the CNN_Medium model, with Adam optimizer, LR=0.001, Epochs=10, and BatchSize=64, also showed strong performance, ranking 4th overall in F1-score with 0.735358.
Final Conclusion on the Efficacy of the Three Models:
CNN_Baseline (Low Complexity): This model served well as a baseline, showing reasonable performance but was significantly outmatched by its more complex counterparts. It’s suitable for quick prototyping or scenarios with very limited computational resources, but lacks the capacity to capture intricate features effectively, especially with less optimized hyperparameters (e.g., Adam optimizer). Its average F1-score ranged from 0.514312 (Adam) to 0.564336 (SGD).
CNN_Medium (Medium Complexity): This model demonstrated a good balance between complexity and performance. It was capable of achieving competitive results, particularly with certain hyperparameter combinations (e.g., the top Adam configuration was a CNN_Medium model). Its increased layers and regularization (Dropout) helped it to learn more complex patterns without incurring excessive computational cost. Its average F1-score ranged from 0.360856 (Adam) to 0.629356 (SGD).
CNN_Complex (High Complexity): As expected, this model generally delivered the best performance, especially when paired with the SGD optimizer. Its deeper architecture, larger number of parameters, and batch normalization layers allowed it to learn highly discriminative features from the CIFAR-10 dataset. However, it was also the most sensitive to hyperparameter choices, performing poorly with suboptimal settings (e.g., Adam optimizer for CNN_Complex resulted in the lowest average F1-score among all models, 0.346080). This highlights its strength in representational capacity but also its vulnerability to inadequate hyperparameter tuning and potential for overfitting if regularization isn’t managed well. Its average F1-score ranged from 0.346080 (Adam) to 0.711393 (SGD).
In conclusion, for optimal performance on the CIFAR-10 dataset, the CNN_Complex model with the SGD optimizer and appropriate learning rate and epoch settings proved to be the most effective, showcasing that increased model complexity, when properly tuned, leads to superior image classification results. The choice of optimizer, particularly SGD over Adam, was a critical factor in unlocking the full potential of the more complex architectures in these experiments.