In this notebook we’ll use PyTorch to build a linear regression model to predict brain weight based on head size.

Table of Contents

Data Preparation¶

The analysis will be based on the ‘Brain weight in humans’ dataset by Anubhab Swain available on kaggle.com – https://www.kaggle.com/datasets/anubhabswain/brain-weight-in-humans/data.

This dataset was compiled using a medical study conducted on a group of people.

This dataset consists of 237 records containing information on particular individuals, such as gender, age range, head size, and brain weight. Detailed information on each feature is included below. We’re going to pick just one feature, head size, to see how well we can predict brain weight in relation to it.

Let’s import the libraries we need:

In [ ]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Data Collection¶

Let’s download and view the data:

In [ ]:

path = r'brain_weight_in_humans.csv'
df = pd.read_csv(path)
df.head()

Out[ ]:

	Gender	Age Range	Head Size(cm^3)	Brain Weight(grams)
0	1	1	4512	1530
1	1	1	3738	1297
2	1	1	4261	1335
3	1	1	3777	1282
4	1	1	4177	1590

Data Description¶

Let’s look at basic information about the data:

In [ ]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 0 to 236
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Gender               237 non-null    int64
 1   Age Range            237 non-null    int64
 2   Head Size(cm^3)      237 non-null    int64
 3   Brain Weight(grams)  237 non-null    int64
dtypes: int64(4)
memory usage: 7.5 KB

The DataFrame consists of 237 rows, each of which represents a single person, and 4 columns, each of which represents a single feature.

The data is complete, there are no missing values. All columns are of the integer type (int).

Let’s have a closer look at the particular columns:

Gender – 1 represents male, 2 represents female
Age Range – 1 represents adults (18 or older), 2 represents children
Head Size(cm^3) – head volume in cubic centimeters
Brain Weight(grams)

The last column, Brain Weight(grams), serves as the target variable for prediction.

Data Preprocessing and Cleaning¶

Removing Redundant Rows and Columns¶

There are no redundant rows or columns.

Missing Data¶

There is no missing data:

In [ ]:

df.isnull().sum()

Out[ ]:

	0
Gender	0
Age Range	0
Head Size(cm^3)	0
Brain Weight(grams)	0

dtype: int64

Duplicates¶

There are no duplicates:

In [ ]:

df.duplicated().any()

Out[ ]:

False

Data Transformations¶

We only have numerical data. We don’t need any data transformations.

A Statistical Summary of the Numeric Features¶

Let’s have a look at a statistical summary of the numeric features for the entire dataset:

In [ ]:

df.describe()

Out[ ]:

	Gender	Age Range	Head Size(cm^3)	Brain Weight(grams)
count	237.000000	237.000000	237.000000	237.000000
mean	1.434599	1.535865	3633.991561	1282.873418
std	0.496753	0.499768	365.261422	120.340446
min	1.000000	1.000000	2720.000000	955.000000
25%	1.000000	1.000000	3389.000000	1207.000000
50%	1.000000	2.000000	3614.000000	1280.000000
75%	2.000000	2.000000	3876.000000	1350.000000
max	2.000000	2.000000	4747.000000	1635.000000

The data looks reasonable.

Outliers¶

Let’s check if there are any outliers:

In [ ]:

plt.figure(figsize=(12, 6))
plt.subplot(3, 1, 1)
sns.boxplot(data=df, x=df["Gender"], orient="h")
plt.subplot(3, 1, 2)
sns.boxplot(data=df, x=df["Age Range"], orient="h")
plt.subplot(3, 1, 3)
sns.boxplot(data=df, x=df["Head Size(cm^3)"], orient="h")
plt.grid()

No description has been provided for this image

We can only see outliers in the Head Size(cm^3) column. Let’s count them:

In [ ]:

len(df[df["Head Size(cm^3)"] > 4500])

Out[ ]:

There are 2 outliers. We could remove the rows where they are, but they don’t differ that much from the rest, so let’s keep them.

Data Visualization¶

It’s always easier to understand the data when you see it. Let’s visualize our data then.

Correlations Between Features¶

Let’s have a look at the relationships between the particular features. In particular, we’re interested in how the features correlate with the target feature, which is Brain Weight(grams).

Let’s plot the pairwise relationships between the features first:

In [ ]:

sns.pairplot(df)
plt.show()

Next, let’s check the correlations between the features:

In [ ]:

df_corr = df.corr()
df_corr

Out[ ]:

	Gender	Age Range	Head Size(cm^3)	Brain Weight(grams)
Gender	1.000000	-0.088652	-0.514050	-0.465266
Age Range	-0.088652	1.000000	-0.105428	-0.169438
Head Size(cm^3)	-0.514050	-0.105428	1.000000	0.799570
Brain Weight(grams)	-0.465266	-0.169438	0.799570	1.000000

There are both positive and negative correlations. Let’s create a correlation matrix:

In [ ]:

mask = np.triu(df_corr)
plt.figure(figsize = (10, 4))
plt.title("Correlation Matrix")
sns.heatmap(df_corr,
            cmap = 'viridis',
            annot = True,
            annot_kws={"size": 7},
            mask = mask, linecolor = 'white',
            linewidth = .5,
            fmt = '.3f')

Out[ ]:

<Axes: title={'center': 'Correlation Matrix'}>

The feature we want to examine the correlations against is Brain Weight(grams). The strongest positive correlation is with Head Size(cm^3). The strongest negative correlation is with Gender. The correlation with Age Range is weaker, but it’s not a weak correlation either. As mentioned before, we’re going to pick just the head size for further analysis, which turns out to be the strongest correlation. Let’s plot brain weight against head size again:

In [ ]:

plt.figure(figsize = (10, 4))
sns.scatterplot(data = df, x = 'Head Size(cm^3)', y = 'Brain Weight(grams)')
plt.grid()

We can clearly see a linear relationship here.

Inputs and Outputs¶

Head size(cm^3) is our input, Brain Weight(grams) is our output:

In [ ]:

X = df['Head Size(cm^3)']
y = df['Brain Weight(grams)']

Let’s check the input and output shapes:

In [ ]:

X.shape, y.shape

Out[ ]:

((237,), (237,))

So, there are 237 records. We have 1 input for 1 output.

Training Set and Test Set¶

Before we build our model, we should split the data into two separate sets, a training set and a test set:

In [ ]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

len(X_train), len(X_test), len(y_train), len(y_test)

Out[ ]:

(189, 48, 189, 48)

We have 189 training samples and 48 testing samples. Now we can start building the model.

Data Scaling¶

The two features we’re interested in, Head Size(cm^3) and the target feature Brain Weight(grams) are in different scales. Let’s have a look:

In [ ]:

X_train.head(), y_train.head()

Out[ ]:

(183    3181
 201    3228
 230    3685
 95     3779
 190    3165
 Name: Head Size(cm^3), dtype: int64,
 183    1175
 201    1235
 230    1350
 95     1165
 190    1237
 Name: Brain Weight(grams), dtype: int64)

To make them more comparable, let’s scale them using StandardScaler:

In [ ]:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train.values.reshape(-1, 1))
X_test = scaler.transform(X_test.values.reshape(-1, 1))
y_train = scaler.fit_transform(y_train.values.reshape(-1, 1))
y_test = scaler.transform(y_test.values.reshape(-1, 1))

X_train[:5], X_test[:5], y_train[:5], y_test[:5]

Out[ ]:

(array([[-1.29295141],
        [-1.15994903],
        [ 0.13328686],
        [ 0.39929162],
        [-1.33822882]]),
 array([[-0.39306298],
        [ 0.67378589],
        [-0.16384611],
        [-0.69868546],
        [ 0.41627064]]),
 array([[-0.90152182],
        [-0.39277549],
        [ 0.58232164],
        [-0.98631288],
        [-0.37581728]]),
 array([[-0.05361128],
        [ 1.04867244],
        [ 0.32794847],
        [-1.3254771 ],
        [ 1.00627691]]))

Now the data is rescaled in such a way that the values’ mean is 0 and the standard deviation is 1.

Tensors¶

PyTorch works with tensors, so we have to turn our data into tensors:

In [ ]:

import torch

X_train = torch.tensor(X_train).type(torch.float)
X_test = torch.tensor(X_test).type(torch.float)
y_train = torch.tensor(y_train).type(torch.float)
y_test = torch.tensor(y_test).type(torch.float)

# Let's view some samples.
X_train[:5], X_test[:5], y_train[:5], y_test[:5]

Out[ ]:

(tensor([[-1.2930],
         [-1.1599],
         [ 0.1333],
         [ 0.3993],
         [-1.3382]]),
 tensor([[-0.3931],
         [ 0.6738],
         [-0.1638],
         [-0.6987],
         [ 0.4163]]),
 tensor([[-0.9015],
         [-0.3928],
         [ 0.5823],
         [-0.9863],
         [-0.3758]]),
 tensor([[-0.0536],
         [ 1.0487],
         [ 0.3279],
         [-1.3255],
         [ 1.0063]]))

Model Building¶

We have the data in place, it’s time to build a model. Besides the model, we’ll define a loss function and optimizer.

But before that, let’s make our code device agnostic. This is not strictly necessary for such a small dataset as ours, but it’s good practice. This way, we’ll make use of GPU if it’s available, otherwise we’ll make use of CPU:

In [ ]:

device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

Out[ ]:

'cpu'

In Google Colab, we can change the runtime to GPU in the Runtime menu under Change runtime type. We’re not going to do it here, though.

Defining the Model¶

We now want to build a model that will take our inputs and produce outputs similar to our current outputs. In other words, given the features, the model will predict the labels.

There are a couple ways of approaching this in PyTorch, but we’ll create the model as a class. If we create a model as a class, we almost always inherit from nn.Module. Then, inside the __init__ method, we create the layers of the neural network. In our case, we only need linear layers.

It’s up to us how many layers we create. It depends on how much space we want to give the model to learn. In any case, we pass two arguments to the nn.Linear method: in_features, set to the number of inputs in a particular layer, and out_features, set to the number of outputs from a particular layer. We can set these arguments to any numbers we want. We just have to follow the following rules:

In the first layer, we set in_features to the number of inputs to the model.
In the last layer, we set out_features to the number of outputs from the model.
In the hidden layers (the layers between the first and last layers), the number if in_features must be equal to the number of out_features in the preceding layer.

To keep it simple, let’s just create one linear layer.

We also have to define a forward method, which will contain the forward pass computation of the model.

In order to inherit from nn.Module and create the layers of the neural network, we have to import nn:

In [ ]:

from torch import nn

And now, let’s build the model:

In [ ]:

class BrainWeightModel(nn.Module):
    def __init__(self):
        super().__init__()

        # Here we create a linear layer. We have 1 input for 1 output.
        self.layer_1 = nn.Linear(in_features=1, out_features=1)

    # Here we define the method that will compute the forward pass.
    def forward(self, x):
        # The computation will go through the layer.
        return self.layer_1(x)

Let’s create an instance of the model and send it to the target device:

In [ ]:

model = BrainWeightModel()
model, model.state_dict()

Out[ ]:

(BrainWeightModel(
   (layer_1): Linear(in_features=1, out_features=1, bias=True)
 ),
 OrderedDict([('layer_1.weight', tensor([[0.7645]])),
              ('layer_1.bias', tensor([0.8300]))]))

As we can see, weight and bias are set to random values. This model hasn’t been trained yet, so it won’t perform well until we train it.

Let’s make the code device-agnostic so that it can run on both CPU and GPU, or rather on GPU if it’s available or on CPU otherwise:

In [ ]:

model.to(device)
next(model.parameters()).device

Out[ ]:

device(type='cpu')

Next, let’s define the loss function and optimizer.

Loss Function and Optimizer¶

The loss function is used to measure how wrong your model’s predictions are compared to the truth labels (the labels in the y_test set).

The optimizer instructs your model to update its internal parameters to lower the loss.

There are a lot of loss functions in PyTorch we can choose from. For linear regression, some common choices are MAE (mean absolute error) and MSE (mean squared error). We’re going to use the former. To this end, there’s the torch.nn.L1Loss method.

There are also a lot of optimizers. Some common ones are Adam and SGD (stochastic gradient descent). Let’s pick SGD. In order to do that, we’ll use the torch.optim.SGD method.

The SGD optimizer takes two parameters:

params – these are the model’s parameters that we want to optimize,
lr – this is the learning rate – the higher it is the faster the optimizer will update the parameters.

We must be careful with the learning rate. It should be neither too high nor too low or it will fail to work.

So, here are the loss function and the optimizer:

In [ ]:

loss_fn = nn.L1Loss()
optimizer = torch.optim.SGD(params=model.parameters(), lr=0.01)

Now, we’re ready to train the model.

Model Training¶

Training the model involves two loops: a training loop, where the model learns the relationships between the features and labels, and a testing loop, where the model is evaluated. Let’s see what exactly each of the loops contains.

Training Loop¶

As mentioned above, in the training loop, the model goes through the training data and learns how the features are related to the labels.

The steps inside a training loop contain:

Forward pass – the model performs the forward method on all training data,
Loss calculation – the model’s predictions are compared to the test values to see how badly the model performs,
Gradient zeroing – the optimizer’s gradients are set to zero (by default, they’re accumulated) so that they can be calculated from scratch for this step,
Backpropagation – the gradient of the loss with respect to each parameter with requires_grad set to True is calculated,
Gradient descent – the parameters are updated.

Testing Loop¶

The testing loop consists of the following steps:

Forward pass – the model performs the forward method on all testing data,
Loss calculation – the model’s predictions are compared to the test values to see how badly the model performs,
(optionally) Evaluation metrics – we can calculate metrics like accuracy, precision or recall on the test set; we’re not going to do it here.

Training the Model¶

Now, let’s implement all these steps in the training and testing loops:

In [ ]:

# seed for reprodicibility of random values
torch.manual_seed(42)

# Train for 1000 epochs.
epochs = 1000

# Put data on the available device
X_train = X_train.to(device)
X_test = X_test.to(device)
y_train = y_train.to(device)
y_test = y_test.to(device)

for epoch in range(epochs):
    ### Training loop

    # Put the model in training mode.
    model.train()

    # The stages as described in the text.

    # 1. Forward pass
    y_pred = model(X_train)

    # 2. Loss calculation
    loss = loss_fn(y_pred, y_train)

    # 3. Gradient zeroing
    optimizer.zero_grad()

    # 4. Backpropagation
    loss.backward()

    # 5. Gradient descent
    optimizer.step()

    ### Testing loop

    # Put the model in evaluation mode.
    model.eval()

    with torch.inference_mode():

        # The stages as described in the text.
        # 1. Forward pass
        test_pred = model(X_test)

        # 2. Loss calculation
        # Predictions are of the float data type and so must be the test labels in order to compare them.
        test_loss = loss_fn(test_pred, y_test)

        # Add data to the lists and print information every tenth epoch
        if epoch % 100 == 0:
            print(f"Epoch: {epoch} | Train Loss (MAE): {loss} | Test Loss (MAE): {test_loss} ")

Epoch: 0 | Train Loss (MAE): 0.8993837237358093 | Test Loss (MAE): 0.7772954106330872 
Epoch: 100 | Train Loss (MAE): 0.5219250321388245 | Test Loss (MAE): 0.4605059325695038 
Epoch: 200 | Train Loss (MAE): 0.48480406403541565 | Test Loss (MAE): 0.45754313468933105 
Epoch: 300 | Train Loss (MAE): 0.48120251297950745 | Test Loss (MAE): 0.46735110878944397 
Epoch: 400 | Train Loss (MAE): 0.48098069429397583 | Test Loss (MAE): 0.4684349000453949 
Epoch: 500 | Train Loss (MAE): 0.48097139596939087 | Test Loss (MAE): 0.46871528029441833 
Epoch: 600 | Train Loss (MAE): 0.48097148537635803 | Test Loss (MAE): 0.46871641278266907 
Epoch: 700 | Train Loss (MAE): 0.4809715747833252 | Test Loss (MAE): 0.46871280670166016 
Epoch: 800 | Train Loss (MAE): 0.4809720814228058 | Test Loss (MAE): 0.4687314033508301 
Epoch: 900 | Train Loss (MAE): 0.48097214102745056 | Test Loss (MAE): 0.46872782707214355

The loss is lower than it was, which is good. Let’s test our model now.

Model Evaluation¶

Let’s see how our trained model performs on test data:

In [ ]:

model.eval()

with torch.inference_mode():
    y_pred = model(X_test)

y_pred[:5], y_test[:5]

Out[ ]:

(tensor([[-0.3116],
         [ 0.5047],
         [-0.1362],
         [-0.5454],
         [ 0.3076]]),
 tensor([[-0.0536],
         [ 1.0487],
         [ 0.3279],
         [-1.3255],
         [ 1.0063]]))

Let’s plot the training data, test data and predictions:

In [ ]:

plt.figure(figsize = (10, 4))
plt.scatter(X_train, y_train, c='b',s=8, label = 'Training data', alpha=.5)
plt.scatter(X_test, y_test, c='g',s=8, label = 'Test data', alpha=.5)
plt.scatter(X_test, y_pred, c='r',s=8, label = 'Predictions')
plt.legend(prop={"size": 14})
plt.grid()

As we can see, the predictions are not ideal, but realistic. This model may require some fine-tuning, but we have a working prototype, let’s say. Most of the predictions are pretty close to the test data.

Conclusion¶

The linear regression model we created is pretty reliable. Most of the predictions are close to the test data, so it does its job. Naturally, there’s always room for improvement, but I’ll leave it for you if you feel like playing with it.

Brain Weight by Head Size in Humans – Linear Regression with PyTorch

Data Preparation¶

Data Collection¶

Data Description¶

Data Preprocessing and Cleaning¶

Removing Redundant Rows and Columns¶

Missing Data¶

Duplicates¶

Data Transformations¶

A Statistical Summary of the Numeric Features¶

Outliers¶

Data Visualization¶

Correlations Between Features¶

Inputs and Outputs¶

Training Set and Test Set¶

Data Scaling¶

Tensors¶

Model Building¶

Defining the Model¶

Loss Function and Optimizer¶

Model Training¶

Training Loop¶

Testing Loop¶

Training the Model¶

Model Evaluation¶

Conclusion¶

Like this:

Leave a ReplyCancel reply

Brain Weight by Head Size in Humans – Linear Regression with PyTorch

Data Preparation¶

Data Collection¶

Data Description¶

Data Preprocessing and Cleaning¶

Removing Redundant Rows and Columns¶

Missing Data¶

Duplicates¶

Data Transformations¶

A Statistical Summary of the Numeric Features¶

Outliers¶

Data Visualization¶

Correlations Between Features¶

Inputs and Outputs¶

Training Set and Test Set¶

Data Scaling¶

Tensors¶

Model Building¶

Defining the Model¶

Loss Function and Optimizer¶

Model Training¶

Training Loop¶

Testing Loop¶

Training the Model¶

Model Evaluation¶

Conclusion¶

Share this:

Like this:

Leave a ReplyCancel reply