Today we’ll learn how to create structured numpy arrays.
In all the arrays up to now we had homogeneous data, so data of one and the same type, like all ints or all floats. But we can also define structured numpy arrays with columns of different types. Let’s say we want to implement the following table as an array:
Name | Occupation | Age | Income |
John Smith | dentist | 47 | 30534 |
Mike Turner | driver | 50 | 24130 |
Lisa Steep | teacher | 28 | 29547 |
Jennifer Lee | nurse | 31 | 25478 |
dtype
Using dtype we can define a different data type for each column. As you can see, we’ll need strings for the first two columns and integers for the other two.
First, let’s define a single column, for example the Age column and populate it. We can do it like so:
ages = np.array([(47,), (50,), (28,), (31,)],
dtype = np.dtype([('age', 'i1')]))
# We access the data in the age column by key.
print(ages['age'])
Here’s the output:
[47 50 28 31]
For better readability we could assign the data type to a variable:
age = np.dtype([('age', 'i1')])
ages = np.array([(47,), (50,), (28,), (31,)],
dtype = age)
# We access the data in the age column by key.
print(ages['age'])
Creating Structured numpy Arrays
Now we’re ready to create our structured array, the one shown before. It will have four columns and we’ll populate it with the data presented before.
Here’s the code in which we create and use the structured array:
import numpy as np
# Let's define a data type and assign it to a variable.
# We want the strings to be 20-character unicode strings,
# so we should use the string U20. The age should be a 1-byte
# integer and the income should be a 4-byte integer.
worker = np.dtype([('name', 'U20'),
('occupation', 'U20'),
('age', 'i1'),
('income', 'i4')])
workers = np.array([
('John Smith', 'dentist', 47, 30534),
('Mike Turner', 'driver', 50, 24130),
('Lisa Steep', 'teacher', 28, 29547),
('Jennifer Lee', 'nurse', 31, 25478)],
dtype = worker)
# We can access the data by row, by column or individually.
# If you need a whole row, just use the index of that row.
# So, if you need the second row, you should use the index 1.
print("Second row:")
print(workers[1])
print()
# You can access a whole column by key, like before. Let's access the
# occupation column for example.
print("The occupation column:")
print(workers['occupation'])
print()
# And now let's access the third item in the names column.
print("The third name:")
print(workers['name'][2])
print()
# or so
print("The third name again:")
print(workers[2]['name'])
If we run this program, we’ll get the following output:
Second row:
('Mike Turner', 'driver', 50, 24130)
The occupation column:
['dentist' 'driver' 'teacher' 'nurse']
The third name:
Lisa Steep
The third name again:
Lisa Steep
A More Complex Example
And now a slightly more complex example. Let’s implement the following table as a structured numpy array:
Box ID | Size | Content | ||||
Length | Width | Height | Item | Amount | Price | |
210316 | 49.52 | 40.25 | 11.89 | book | 11 | 128.65 |
211541 | 39.22 | 21.15 | 18.36 | pencil | 480 | 240.00 |
199520 | 51.50 | 19.47 | 15.45 | eraser | 1200 | 400.25 |
We’ll use the following types:
– int32 for the box id
– float64 for length, width, height and price
– int16 for amount
– unicode (30 characters) for item
This time we won’t use the string representations of the data types, but rather the full type names, so int32 instead of i4, for example. For unicode we’ll need the following syntax: np.unicode, 30.
Here’s our implementation:
import numpy as np
box = np.dtype([('boxID', np.int32),
('size', [('length', np.float64),
('width', np.float64),
('height', np.float64)]),
('content', [('item', np.unicode, 30),
('amount', np.int16),
('price', np.float64)])])
boxes = np.array([
(210316, (49.52, 40.25, 11.89), ('book', 11, 128.65)),
(211541, (39.22, 21.15, 18.36), ('pencil', 480, 240.00)),
(199520, (51.50, 19.47, 15.45), ('eraser', 1200, 400.25))],
dtype = box)
# last row
print("Last row:")
print(boxes[-1])
print()
# the box id column
print("The box id column:")
print(boxes['boxID'])
print()
# the width column
print("The width column:")
print(boxes['size']['width'])
print()
# the content columns
print("The content columns:")
print(boxes['content'])
print()
# the price in the second row
print("The price in row 2:")
print(boxes['content']['price'][1])
print()
And here’s the output:
Last row:
(199520, (51.5, 19.47, 15.45), ('eraser', 1200, 400.25))
The box id column:
[210316 211541 199520]
The width column:
[40.25 21.15 19.47]
The content columns:
[('book', 11, 128.65) ('pencil', 480, 240. ) ('eraser', 1200, 400.25)]
The price in row 2:
240.0
Here’s the video version of the article: