Skip to content
Home » pandas Part 11 – Missing Data

pandas Part 11 – Missing Data

Spread the love

MISSING DATA

In real-life scenarios data is seldom complete. How does pandas handle missing data? Generally, it uses two null values, None and NaN, the latter standing for Not a Number. The problem with None, though, is that it’s a Python type. To be precise, its dtype is object. If you use it in a numpy array along with ints, where all elements must be of the same type, the nearest common type for int and None will be object:

In [1]:
import numpy as np
import pandas as pd

nums = np.array([1, 2, 3, None, 4, 5])
nums
Out[1]:
array([1, 2, 3, None, 4, 5], dtype=object)

It takes much more time to process objects than ints. So, what about the other null value, NaN? Well, the dtype of NaN is float64 and it’s defined in numpy:

In [4]:
nums = np.array([1, 2, 3, np.nan, 4, 5])
nums, nums.dtype
Out[4]:
(array([ 1.,  2.,  3., nan,  4.,  5.]), dtype('float64'))

If used in operations, the best common type for NaN and any other type is NaN:

In [5]:
np.nan + 10
Out[5]:
nan
In [6]:
5 * np.nan
Out[6]:
nan

What about aggregates? Let’s see. We’re going to use the nums array with the NaN value defined a moment ago:

In [7]:
# minimum value
nums.min()
Out[7]:
nan
In [8]:
# maximum value
nums.max()
Out[8]:
nan
In [9]:
# sum
nums.sum()
Out[9]:
nan

It works, so we don’t get an error. But we don’t get anything useful at all, to be exact. Fortunately, there are methods in numpy that ignore NaN values and take only the values that are not NaN into account:

In [11]:
# minimum value
np.nanmin(nums)
Out[11]:
1.0
In [12]:
# maximum value
np.nanmax(nums)
Out[12]:
5.0
In [13]:
# sum
np.nansum(nums)
Out[13]:
15.0

TESTING FOR NULL VALUES

In practice None and NaN are used in a similar way. There are a couple operations that we can perform on data with null value. The first thing we can do is test for null values. The two methods we can use to this end are isnull and notnull, with pretty self-explanatory names. They both return a mask over the data. Let’s check them out:

In [16]:
# let's create a Series object from the nums array
num_series = pd.Series(nums)

# and now let's use the isnull method
num_series.isnull()
Out[16]:
0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool
In [17]:
# and now the notnull method
num_series.notnull()
Out[17]:
0     True
1     True
2     True
3    False
4     True
5     True
dtype: bool
In [19]:
# and now let's use the masks to index the data with the isnull method
num_series[num_series.isnull()]
Out[19]:
3   NaN
dtype: float64
In [20]:
# and with the notnull method
num_series[num_series.notnull()]
Out[20]:
0    1.0
1    2.0
2    3.0
4    4.0
5    5.0
dtype: float64

DROPPING NULL VALUES

You can easily remove null values from a Series object using the dropna method:

In [21]:
num_series.dropna()
Out[21]:
0    1.0
1    2.0
2    3.0
4    4.0
5    5.0
dtype: float64

With a DataFrame the problem is that default a whole row or column is removed if at least one value in that row or column is a null value. Let’s create a simple DataFrame object to demonstrate it:

In [22]:
df = pd.DataFrame([[4, 3, np.nan],
                  [7, 8, 9],
                  [9, np.nan, 7]])

df
Out[22]:
0 1 2
0 4 3.0 NaN
1 7 8.0 9.0
2 9 NaN 7.0

As you can see, there are NaN values in two rows and in two columns. So, let’s use the dropna method to see how it works:

In [23]:
df.dropna()
Out[23]:
0 1 2
1 7 8.0 9.0

As you can see, the first and last rows have been removed. This is because the method removes all rows with null values by default. If you want it to remove columns with null values instead, you just have to specify the axis:

In [24]:
df.dropna(axis=1)
Out[24]:
0
0 4
1 7
2 9
In [25]:
# Alternatively, you can use the explicit name of the axis.
df.dropna(axis='columns')
Out[25]:
0
0 4
1 7
2 9

If there are lots of rows and columns and quite a few null values, it may happen that most or even all of them will be removed. This way we’ll lose most valuable data as well. This is why we can use the parameter how (which is by default set to ‘any’) with the value ‘all’ if we want only those rows or columns to be removed where all values are null.

Let’s modify our DataFrame a bit so that the second column contains only null values:

In [26]:
df[1] = np.nan
df
Out[26]:
0 1 2
0 4 NaN NaN
1 7 NaN 9.0
2 9 NaN 7.0

Now we’ll remove only the columns where all values are null:

In [27]:
df.dropna(axis='columns', how='all')
Out[27]:
0 2
0 4 NaN
1 7 9.0
2 9 7.0

There’s also the thresh parameter which you can use to set a minimum number of null values in a row or column. Only rows and and columns with minimum that number of null values will be removed. So, here’s how we can remove rows with minimum 2 null values, which in our case means just the first row:

In [28]:
df.dropna(thresh=2)
Out[28]:
0 1 2
1 7 NaN 9.0
2 9 NaN 7.0

FILLING NULL VALUES

Sometimes you may want to replace null values with some default value like 0, or by an interpolated or repeated value. This is where the fillna and interpolate methods comes in handy. Let’s have a look at our Series and DataFrame objects again:

In [33]:
num_series
Out[33]:
0    1.0
1    2.0
2    3.0
3    NaN
4    4.0
5    5.0
dtype: float64
In [30]:
df
Out[30]:
0 1 2
0 4 NaN NaN
1 7 NaN 9.0
2 9 NaN 7.0

Let’s have a look at these two methods in action:

In [34]:
# Let's fill all null values in the DataFrame with zeros.
df.fillna(0)
Out[34]:
0 1 2
0 4 0.0 0.0
1 7 0.0 9.0
2 9 0.0 7.0
In [36]:
# Let's propagate the previous values forward in the Series.
num_series.fillna(method='ffill')
Out[36]:
0    1.0
1    2.0
2    3.0
3    3.0
4    4.0
5    5.0
dtype: float64
In [37]:
# Let's propagate the next values backward in the DataFrame.
df.fillna(method='bfill')
Out[37]:
0 1 2
0 4 NaN 9.0
1 7 NaN 9.0
2 9 NaN 7.0

As you can see in the last example, the values are propagated only if the next value isn’t a NaN itself. This is also true about forward fill that we saw before. Some more examples:

In [38]:
# Let's fill NaN values using the forward fill method and a specific axis.
df.fillna(method='ffill', axis='columns')
Out[38]:
0 1 2
0 4.0 4.0 4.0
1 7.0 7.0 9.0
2 9.0 9.0 7.0
In [40]:
# And now let's use interpolated values in the Series object.
num_series.interpolate()
Out[40]:
0    1.0
1    2.0
2    3.0
3    3.5
4    4.0
5    5.0
dtype: float64
In [44]:
# And finally let's use interpolated values in the Dataframe object with a specified axis (1 is the same as 'columns').
df.interpolate(axis=1)
Out[44]:
0 1 2
0 4.0 4.0 4.0
1 7.0 8.0 9.0
2 9.0 8.0 7.0

Your Panda3D Magazine

Make Awesome Games and Other 3D Apps

with Panda3D and Blender using Python.

Cool stuff, easy to follow articles.

Get the magazine here (PDF).

Python Jumpstart Course

Learn the basics of Python, including OOP.

with lots of exercises, easy to follow

The course is available on Udemy.

Blender Jumpstart Course

Learn the basics of 3D modeling in Blender.

step-by-step, easy to follow, visually rich

The course is available on Udemy and on Skillshare.

Here’s the video version of this article:


Spread the love

Leave a Reply