NumPy Tutorial: Data analysis with Python
NumPy is a commonly used Python data analysis package. By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use NumPy under the hood. NumPy was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages NumPy in some way.
In this tutorial, we’ll walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity, along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we’ll try to figure out more about the perceived quality of wine.
The wines we’ll be analyzing are from the Minho region of Portugal.
The data was downloaded from the UCI Machine Learning Repository, and is available here. Here are the first few rows of the winequality-red.csv file, which we’ll be using throughout this tutorial:
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
The data is in what I’m going to call ssv (semicolon separated values) format – each record is separated by a semicolon (;), and rows are separated by a new line. There are 1600 rows in the file, including a header row, and 12 columns.
Before we get started, a quick version note – we’ll be using Python 3.5. Our code examples will be done using Jupyter notebook.
If you want to jump right into a specific area, here are the topics:
- Creating an Array
- Reading Text Files
- Array Indexing
- N-Dimensional Arrays
- Data Types
- Array Math
- Array Methods
- Array Comparison and Filtering
- Reshaping and Combining Arrays
Lists Of Lists for CSV Data Before using NumPy, we’ll first try to work with the data using Python and the csv package. We can read in the file using the csv.reader object, which will allow us to read in and split up all the content from the ssv file.
In the below code, we:
- Import the csv library.
- Open the winequality-red.csv file.
- With the file open, create a new csv.reader object.
- Pass in the keyword argument delimiter=”;” to make sure that the records are split up on the semicolon character instead of the default comma character.
- Call the list type to get all the rows from the file.
- Assign the result to wines.
- With the file open, create a new csv.reader object.
import csv
with open("winequality-red.csv", 'r') as f:
wines = list(csv.reader(f, delimiter=";"))
# print(wines[:3])
headers = wines[0]
wines_only = wines[1:]
# print the headers
print(headers)
['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
# print the 1st row of data
print(wines_only[0])
['7.4', '0.7', '0', '1.9', '0.076', '11', '34', '0.9978', '3.51', '0.56', '9.4', '5']
# print the 1st three rows of data
print(wines_only[:3])
[['7.4', '0.7', '0', '1.9', '0.076', '11', '34', '0.9978', '3.51', '0.56', '9.4', '5'], ['7.8', '0.88', '0', '2.6', '0.098', '25', '67', '0.9968', '3.2', '0.68', '9.8', '5'], ['11.2', '0.28', '0.56', '1.9', '0.075', '17', '60', '0.998', '3.16', '0.58', '9.8', '6']]
The data has been read into a list of lists. Each inner list is a row from the ssv file. As you may have noticed, each item in the entire list of lists is represented as a string, which will make it harder to do computations.
As you can see from the table above, we’ve read in three rows, the first of which contains column headers. Each row after the header row represents a wine. The first element of each row is the fixed acidity, the second is the volatile acidity, and so on.
Calculate Average Wine Quality
We can find the average quality of the wines. The below code will:
- Extract the last element from each row after the header row.
- Convert each extracted element to a float.
- Assign all the extracted elements to the list qualities.
- Divide the sum of all the elements in qualities by the total number of elements in qualities to the get the mean.
# calculate average wine quality with a loop
qualities = []
for row in wines[1:]:
qualities.append(float(row[-1]))
sum(qualities) / len(wines[1:])
5.636420525657071
# calculate average wine quality with a list comprehension
qualities = [float(row[-1]) for row in wines[1:]]
sum(qualities) / len(wines[1:])
5.636420525657071
Although we were able to do the calculation we wanted, the code is fairly complex, and it won’t be fun to have to do something similar every time we want to compute a quantity. Luckily, we can use NumPy to make it easier to work with our data.
Numpy 2-Dimensional Arrays
With NumPy, we work with multidimensional arrays. We’ll dive into all of the possible types of multidimensional arrays later on, but for now, we’ll focus on 2-dimensional arrays. A 2-dimensional array is also known as a matrix, and is something you should be familiar with. In fact, it’s just a different way of thinking about a list of lists. A matrix has rows and columns. By specifying a row number and a column number, we’re able to extract an element from a matrix.
If we picked the element at the first row and the second column, we’d get volatile acidity. If we picked the element in the third row and the second column, we’d get 0.88.
In a NumPy array, the number of dimensions is called the rank, and each dimension is called an axis. So
- the rows are the first axis
- the columns are the second axis
Now that you understand the basics of matrices, let’s see how we can get from our list of lists to a NumPy array.
Creating A NumPy Array
We can create a NumPy array using the numpy.array function. If we pass in a list of lists, it will automatically create a NumPy array with the same number of rows and columns. Because we want all of the elements in the array to be float elements for easy computation, we’ll leave off the header row, which contains strings. One of the limitations of NumPy is that all the elements in an array have to be of the same type, so if we include the header row, all the elements in the array will be read in as strings. Because we want to be able to do computations like find the average quality of the wines, we need the elements to all be floats.
In the below code, we:
- Import the
numpy
package. - Pass the
list
of lists wines into the array function, which converts it into a NumPy array.- Exclude the header row with list slicing.
- Specify the keyword argument
dtype
to make sure each element is converted to afloat
. We’ll dive more into what thedtype
is later on.
import numpy as np
np.set_printoptions(precision=2) # set the output print precision for readability
# create the numpy array skipping the headers
wines = np.array(wines[1:], dtype=np.float)
/var/folders/jd/pq0swyt521jb2424d6fvth840000gn/T/ipykernel_13638/4037387242.py:5: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
wines = np.array(wines[1:], dtype=np.float)
# If we display wines, we'll now get a NumPy array:
print(type(wines), wines)
<class 'numpy.ndarray'> [[ 7.4 0.7 0. ... 0.56 9.4 5. ]
[ 7.8 0.88 0. ... 0.68 9.8 5. ]
[11.2 0.28 0.56 ... 0.58 9.8 6. ]
...
[ 6.3 0.51 0.13 ... 0.75 11. 6. ]
[ 5.9 0.65 0.12 ... 0.71 10.2 5. ]
[ 6. 0.31 0.47 ... 0.66 11. 6. ]]
# We can check the number of rows and columns in our data using the shape property of NumPy arrays:
wines.shape
(1598, 12)
Alternative NumPy Array Creation Methods
There are a variety of methods that you can use to create NumPy arrays. It’s useful to create an array with all zero elements in cases when you need an array of fixed size, but don’t have any values for it yet. To start with, you can create an array where every element is zero. The below code will create an array with 3 rows and 4 columns, where every element is 0, using numpy.zeros
:
empty_array = np.zeros((3, 4))
empty_array
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
Creating arrays full of random numbers can be useful when you want to quickly test your code with sample arrays. You can also create an array where each element is a random number using numpy.random.rand
.
np.random.rand(2, 3)
array([[0.63, 0.05, 0.44],
[0.67, 0.3 , 0.26]])
Using NumPy To Read In Files
It’s possible to use NumPy to directly read csv
or other files into arrays. We can do this using the numpy.genfromtxt
function. We can use it to read in our initial data on red wines.
In the below code, we:
- Use the
genfromtxt
function to read in thewinequality-red.csv
file. - Specify the keyword argument
delimiter=";"
so that the fields are parsed properly. - Specify the keyword argument
skip_header=1
so that the header row is skipped.
wines = np.genfromtxt("winequality-red.csv", delimiter=";", skip_header=1)
wines
array([[ 7.4 , 0.7 , 0. , ..., 0.56, 9.4 , 5. ],
[ 7.8 , 0.88, 0. , ..., 0.68, 9.8 , 5. ],
[11.2 , 0.28, 0.56, ..., 0.58, 9.8 , 6. ],
...,
[ 6.3 , 0.51, 0.13, ..., 0.75, 11. , 6. ],
[ 5.9 , 0.65, 0.12, ..., 0.71, 10.2 , 5. ],
[ 6. , 0.31, 0.47, ..., 0.66, 11. , 6. ]])
Wines will end up looking the same as if we read it into a list then converted it to an array of floats
. NumPy will automatically pick a data type for the elements in an array based on their format.
Indexing NumPy Arrays
We now know how to create arrays, but unless we can retrieve results from them, there isn’t a lot we can do with NumPy. We can use array indexing to select individual elements, groups of elements, or entire rows and columns.
One important thing to keep in mind is that just like Python lists, NumPy is zero-indexed, meaning that:
- The index of the first row is 0
- The index of the first column is 0
- If we want to work with the fourth row, we’d use index 3
- If we want to work with the second row, we’d use index 1, and so on.
We’ll again work with the wines array:
7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
Let’s select the element at row 3 and column 4.
We pass:
- 2 as the row index
- 3 as the column index.
This retrieves the value from the third row and fourth column
wines[2, 3]
1.9
wines[2][3]
1.9
Since we’re working with a 2-dimensional array in NumPy we specify 2 indexes to retrieve an element.
- The first index is the row, or axis 1, index
- The second index is the column, or axis 2, index
Any element in wines can be retrieved using 2 indexes.
# rows 1, 2, 3 and column 4
wines[0:3, 3]
array([1.9, 2.6, 1.9])
# all rows and column 3
wines[:, 2]
array([0. , 0. , 0.56, ..., 0.13, 0.12, 0.47])
Just like with list
slicing, it’s possible to omit the 0 to just retrieve all the elements from the beginning up to element 3:
# rows 1, 2, 3 and column 4
wines[:3, 3]
array([1.9, 2.6, 1.9])
We can select an entire column by specifying that we want all the elements, from the first to the last. We specify this by just using the colon :
, with no starting or ending indices. The below code will select the entire fourth column:
# all rows and column 4
wines[:, 3]
array([1.9, 2.6, 1.9, ..., 2.3, 2. , 3.6])
We selected an entire column above, but we can also extract an entire row:
# row 4 and all columns
wines[3, :]
array([ 7.4 , 0.7 , 0. , 1.9 , 0.08, 11. , 34. , 1. , 3.51,
0.56, 9.4 , 5. ])
If we take our indexing to the extreme, we can select the entire array using two colons to select all the rows and columns in wines. This is a great party trick, but doesn’t have a lot of good applications:
wines[:, :]
array([[ 7.4 , 0.7 , 0. , ..., 0.56, 9.4 , 5. ],
[ 7.8 , 0.88, 0. , ..., 0.68, 9.8 , 5. ],
[11.2 , 0.28, 0.56, ..., 0.58, 9.8 , 6. ],
...,
[ 6.3 , 0.51, 0.13, ..., 0.75, 11. , 6. ],
[ 5.9 , 0.65, 0.12, ..., 0.71, 10.2 , 5. ],
[ 6. , 0.31, 0.47, ..., 0.66, 11. , 6. ]])
Assigning Values To NumPy Arrays
We can also use indexing to assign values to certain elements in arrays. We can do this by assigning directly to the indexed value:
# assign the value of 10 to the 2nd row and 6th column
print('Before', wines[1, 4:7])
wines[1, 5] = 10
print('After', wines[1, 4:7])
Before [ 0.1 25. 67. ]
After [ 0.1 10. 67. ]
We can do the same for slices. To overwrite an entire column, we can do this:
# Overwrites all the values in the eleventh column with 50.
print('Before', wines[:, 9:12])
wines[:, 10] = 50
print('After', wines[:, 9:12])
Before [[ 0.56 9.4 5. ]
[ 0.68 9.8 5. ]
[ 0.58 9.8 6. ]
...
[ 0.75 11. 6. ]
[ 0.71 10.2 5. ]
[ 0.66 11. 6. ]]
After [[ 0.56 50. 5. ]
[ 0.68 50. 5. ]
[ 0.58 50. 6. ]
...
[ 0.75 50. 6. ]
[ 0.71 50. 5. ]
[ 0.66 50. 6. ]]
1-Dimensional NumPy Arrays
So far, we’ve worked with 2-dimensional arrays, such as wines. However, NumPy is a package for working with multidimensional arrays.
One of the most common types of multidimensional arrays is the 1-dimensional array, or vector. As you may have noticed above, when we sliced wines, we retrieved a 1-dimensional array.
- A 1-dimensional array only needs a single index to retrieve an element.
- Each row and column in a 2-dimensional array is a 1-dimensional array.
Just like a list of lists is analogous to a 2-dimensional array, a single list is analogous to a 1-dimensional array.
If we slice wines and only retrieve the third row, we get a 1-dimensional array:
third_wine = wines[3,:]
third_wine
array([ 7.4 , 0.7 , 0. , 1.9 , 0.08, 11. , 34. , 1. , 3.51,
0.56, 50. , 5. ])
We can retrieve individual elements from third_wine
using a single index.
# display the second item in third_wine
third_wine[1]
0.7
Most NumPy functions that we’ve worked with, such as numpy.random.rand
, can be used with multidimensional arrays. Here’s how we’d use numpy.random.rand
to generate a random vector:
np.random.rand(3)
array([0.04, 0.38, 0.08])
Previously, when we called np.random.rand
, we passed in a shape for a 2-dimensional array, so the result was a 2-dimensional array. This time, we passed in a shape for a single dimensional array. The shape specifies the number of dimensions, and the size of the array in each dimension.
A shape of (10,10)
will be a 2-dimensional array with 10 rows and 10 columns. A shape of (10,)
will be a 1-dimensional array with 10 elements.
Where NumPy gets more complex is when we start to deal with arrays that have more than 2 dimensions.
N-Dimensional NumPy Arrays
This doesn’t happen extremely often, but there are cases when you’ll want to deal with arrays that have greater than 3 dimensions. One way to think of this is as a list of lists of lists. Let’s say we want to store the monthly earnings of a store, but we want to be able to quickly lookup the results for a quarter, and for a year. The earnings for one year might look like this:
[500, 505, 490, 810, 450, 678, 234, 897, 430, 560, 1023, 640]
The store earned $500 in January, $505 in February, and so on. We can split up these earnings by quarter into a list of lists:
year_one = [
[500,505,490], # 1st quarter
[810,450,678], # 2nd quarter
[234,897,430], # 3rd quarter
[560,1023,640] # 4th quarter
]
We can retrieve the earnings from January by calling year_one[0][0]
. If we want the results for a whole quarter, we can call year_one[0]
or year_one[1]
.
We now have a 2-dimensional array, or matrix. But what if we now want to add the results from another year? We have to add a third dimension:
earnings = [
[ # year 1
[500,505,490], # year 1, 1st quarter
[810,450,678], # year 1, 2nd quarter
[234,897,430], # year 1, 3rd quarter
[560,1023,640] # year 1, 4th quarter
],
[ # year =2
[600,605,490], # year 2, 1st quarter
[345,900,1000],# year 2, 2nd quarter
[780,730,710], # year 2, 3rd quarter
[670,540,324] # year 2, 4th quarter
]
]
We can retrieve the earnings from January of the first year by calling earnings[0][0][0]
.
We now need three indexes to retrieve a single element. A three-dimensional array in NumPy is much the same. In fact, we can convert earnings to an array and then get the earnings for January of the first year:
earnings = np.array(earnings)
# year 1, 1st quarter, 1st month (January)
earnings[0,0,0]
500
# year 2, 3rd quarter, 1st month (July)
earnings[1,2,0]
780
# we can also find the shape of the array
earnings.shape
(2, 4, 3)
Indexing and slicing work the exact same way with a 3-dimensional array, but now we have an extra axis to pass in. If we wanted to get the earnings for January of all years, we could do this:
# all years, 1st quarter, 1st month (January)
earnings[:,0,0]
array([500, 600])
If we wanted to get first quarter earnings from both years, we could do this:
# all years, 1st quarter, all months (January, February, March)
earnings[:,0,:]
array([[500, 505, 490],
[600, 605, 490]])
Adding more dimensions can make it much easier to query your data if it’s organized in a certain way. As we go from 3-dimensional arrays to 4-dimensional and larger arrays, the same properties apply, and they can be indexed and sliced in the same ways.
NumPy Data Types
As we mentioned earlier, each NumPy array can store elements of a single data type. For example, wines contains only float values.
NumPy stores values using its own data types, which are distinct from Python types like float
and str
.
This is because the core of NumPy is written in a programming language called C
, which stores data differently than the Python data types. NumPy data types map between Python and C, allowing us to use NumPy arrays without any conversion hitches.
You can find the data type of a NumPy array by accessing the dtype property:
wines.dtype
dtype('float64')
NumPy has several different data types, which mostly map to Python data types, like float
, and str
. You can find a full listing of NumPy data types here, but here are a few important ones:
float
– numeric floating point data.int
– integer data.string
– character data.object
– Python objects.
Data types additionally end with a suffix that indicates how many bits of memory they take up. So int32
is a 32 bit integer data type, and float64
is a 64 bit float data type.
Converting Data Types
You can use the numpy.ndarray.astype method to convert an array to a different type. The method will actually copy the array, and return a new array with the specified data type.
For instance, we can convert wines to the int
data type:
# convert wines to the int data type
wines.astype(int)
array([[ 7, 0, 0, ..., 0, 50, 5],
[ 7, 0, 0, ..., 0, 50, 5],
[11, 0, 0, ..., 0, 50, 6],
...,
[ 6, 0, 0, ..., 0, 50, 6],
[ 5, 0, 0, ..., 0, 50, 5],
[ 6, 0, 0, ..., 0, 50, 6]])
As you can see above, all of the items in the resulting array are integers. Note that we used the Python int
type instead of a NumPy data type when converting wines. This is because several Python data types, including float
, int
, and string
, can be used with NumPy, and are automatically converted to NumPy data types.
We can check the name property of the dtype
of the resulting array to see what data type NumPy mapped the resulting array to:
# convert to int
int_wines = wines.astype(int)
# check the data type
int_wines.dtype.name
'int64'
The array has been converted to a 64-bit integer data type. This allows for very long integer values, but takes up more space in memory than storing the values as 32-bit integers.
If you want more control over how the array is stored in memory, you can directly create NumPy dtype objects like numpy.int32
np.int32
numpy.int32
You can use these directly to convert between types:
# convert to a 64-bit integer
wines.astype(np.int64)
array([[ 7, 0, 0, ..., 0, 50, 5],
[ 7, 0, 0, ..., 0, 50, 5],
[11, 0, 0, ..., 0, 50, 6],
...,
[ 6, 0, 0, ..., 0, 50, 6],
[ 5, 0, 0, ..., 0, 50, 5],
[ 6, 0, 0, ..., 0, 50, 6]])
# convert to a 32-bit integer
wines.astype(np.int32)
array([[ 7, 0, 0, ..., 0, 50, 5],
[ 7, 0, 0, ..., 0, 50, 5],
[11, 0, 0, ..., 0, 50, 6],
...,
[ 6, 0, 0, ..., 0, 50, 6],
[ 5, 0, 0, ..., 0, 50, 5],
[ 6, 0, 0, ..., 0, 50, 6]], dtype=int32)
# convert to a 16-bit integer
wines.astype(np.int16)
array([[ 7, 0, 0, ..., 0, 50, 5],
[ 7, 0, 0, ..., 0, 50, 5],
[11, 0, 0, ..., 0, 50, 6],
...,
[ 6, 0, 0, ..., 0, 50, 6],
[ 5, 0, 0, ..., 0, 50, 5],
[ 6, 0, 0, ..., 0, 50, 6]], dtype=int16)
# convert to a 8-bit integer
wines.astype(np.int8)
array([[ 7, 0, 0, ..., 0, 50, 5],
[ 7, 0, 0, ..., 0, 50, 5],
[11, 0, 0, ..., 0, 50, 6],
...,
[ 6, 0, 0, ..., 0, 50, 6],
[ 5, 0, 0, ..., 0, 50, 5],
[ 6, 0, 0, ..., 0, 50, 6]], dtype=int8)
NumPy Array Operations
NumPy makes it simple to perform mathematical operations on arrays. This is one of the primary advantages of NumPy, and makes it quite easy to do computations.
Single Array Math
If you do any of the basic mathematical operations /
, *
, -
, +
, ^
with an array and a value, it will apply the operation to each of the elements in the array.
Let’s say we want to add 10 points to each quality score because we’re feeling generous. Here’s how we’d do that:
# add 10 points to the quality score
wines[:,-1] + 10
array([15., 15., 16., ..., 16., 15., 16.])
Note: that the above operation won’t change the wines array – it will return a new 1-dimensional array where 10 has been added to each element in the quality column of wines.
If we instead did +=
, we’d modify the array in place:
print('Before', wines[:,11])
# modify the data in place
wines[:,11] += 10
print('After', wines[:,11])
Before [5. 5. 6. ... 6. 5. 6.]
After [15. 15. 16. ... 16. 15. 16.]
All the other operations work the same way. For example, if we want to multiply each of the quality score by 2, we could do it like this:
# multiply the quality score by 2
wines[:,11] * 2
array([30., 30., 32., ..., 32., 30., 32.])
Multiple Array Math
It’s also possible to do mathematical operations between arrays. This will apply the operation to pairs of elements. For example, if we add the quality column to itself, here’s what we get:
# add the quality column to itself
wines[:,11] + wines[:,11]
array([30., 30., 32., ..., 32., 30., 32.])
Note that this is equivalent to wines[:,11] * 2
– this is because NumPy adds each pair of elements. The first element in the first array is added to the first element in the second array, the second to the second, and so on.
# add the quality column to itself
wines[:,11] * 2
array([30., 30., 32., ..., 32., 30., 32.])
We can also use this to multiply arrays. Let’s say we want to pick a wine that maximizes alcohol content and quality. We’d multiply alcohol by quality, and select the wine with the highest score:
# multiply alcohol content by quality
alcohol_by_quality = wines[:,10] * wines[:,11]
print(alcohol_by_quality)
[750. 750. 800. ... 800. 750. 800.]
alcohol_by_quality.sort()
print(alcohol_by_quality, alcohol_by_quality[-1])
[650. 650. 650. ... 900. 900. 900.] 900.0
All of the common operations /
, *
, -
, +
, ^
will work between arrays.
NumPy Array Methods
In addition to the common mathematical operations, NumPy also has several methods that you can use for more complex calculations on arrays. An example of this is the numpy.ndarray.sum
method. This finds the sum of all the elements in an array by default:
# find the sum of all rows and the quality column
total = 0
for row in wines:
total += row[11]
print(total)
24987.0
# find the sum of all rows and the quality column
wines[:,11].sum(axis=0)
24987.0
# find the sum of the rows 1, 2, and 3 across all columns
totals = []
for i in range(3):
total = 0
for col in wines[i,:]:
total += col
totals.append(total)
print(totals)
[125.1438, 158.2548, 161.753]
# find the sum of the rows 1, 2, and 3 across all columns
wines[0:3,:].sum(axis=1)
array([125.14, 158.25, 161.75])
We can pass the axis
keyword argument into the sum method to find sums over an axis.
If we call sum across the wines matrix, and pass in axis=0
, we’ll find the sums over the first axis of the array. This will give us the sum of all the values in every column.
This may seem backwards that the sums over the first axis would give us the sum of each column, but one way to think about this is that the specified axis is the one “going away”.
So if we specify axis=0
, we want the rows to go away, and we want to find the sums for each of the remaining axes across each row:
# sum each column for all rows
totals = [0] * len(wines[0])
for i, total in enumerate(totals):
for row_val in wines[:,i]:
total += row_val
totals[i] = total
print(totals)
[13295.300000000045, 843.2250000000005, 433.2499999999982, 4057.2500000000027, 139.76699999999963, 25354.0, 74248.0, 1592.8009399999985, 5291.210000000001, 1051.7300000000007, 79900.0, 24987.0]
# sum each column for all rows
wines.sum(axis=0)
array([13295.3 , 843.23, 433.25, 4057.25, 139.77, 25354. ,
74248. , 1592.8 , 5291.21, 1051.73, 79900. , 24987. ])
We can verify that we did the sum correctly by checking the shape. The shape should be 12, corresponding to the number of columns:
wines.sum(axis=0).shape
(12,)
If we pass in axis=1, we’ll find the sums over the second axis of the array. This will give us the sum of each row:
# sum each row for all columns
totals = [0] * len(wines)
for i, total in enumerate(totals):
for col_val in wines[i,:]:
total += col_val
totals[i] = total
print(totals[0:3], '...', totals[-3:])
[125.1438, 158.2548, 161.753] ... [149.48174, 155.01547, 141.49249]
# sum each row for all columns
wines.sum(axis=1)
array([125.14, 158.25, 161.75, ..., 149.48, 155.02, 141.49])
wines.sum(axis=1).shape
(1598,)
There are several other methods that behave like the sum method, including:
numpy.ndarray.mean
— finds the mean of an array.numpy.ndarray.std
— finds the standard deviation of an array.numpy.ndarray.min
— finds the minimum value in an array.numpy.ndarray.max
— finds the maximum value in an array.
You can find a full list of array methods here.
NumPy Array Comparisons
NumPy makes it possible to test to see if rows match certain values using mathematical comparison operations like <
, >
, >=
, <=
, and ==
. For example, if we want to see which wines have a quality rating higher than 5, we can do this:
# return True for all rows in the Quality column that are greater than 5
wines[:,11] > 5
array([ True, True, True, ..., True, True, True])
We get a Boolean array that tells us which of the wines have a quality rating greater than 5. We can do something similar with the other operators. For instance, we can see if any wines have a quality rating equal to 10:
# return True for all rows that have a Quality rating of 10
wines[:,11] == 10
array([False, False, False, ..., False, False, False])
Subsetting
One of the powerful things we can do with a Boolean array and a NumPy array is select only certain rows or columns in the NumPy array. For example, the below code will only select rows in wines where the quality is over 7:
# create a boolean array for wines with quality greater than 15
high_quality = wines[:,11] > 15
print(len(high_quality), high_quality)
1598 [False False True ... True False True]
# use boolean indexing to find high quality wines
high_quality_wines = wines[high_quality,:]
print(len(high_quality_wines), high_quality_wines)
855 [[1.12e+01 2.80e-01 5.60e-01 ... 5.80e-01 5.00e+01 1.60e+01]
[7.30e+00 6.50e-01 0.00e+00 ... 4.70e-01 5.00e+01 1.70e+01]
[7.80e+00 5.80e-01 2.00e-02 ... 5.70e-01 5.00e+01 1.70e+01]
...
[5.90e+00 5.50e-01 1.00e-01 ... 7.60e-01 5.00e+01 1.60e+01]
[6.30e+00 5.10e-01 1.30e-01 ... 7.50e-01 5.00e+01 1.60e+01]
[6.00e+00 3.10e-01 4.70e-01 ... 6.60e-01 5.00e+01 1.60e+01]]
We select only the rows where high_quality
contains a True
value, and all of the columns. This subsetting makes it simple to filter arrays for certain criteria.
For example, we can look for wines with a lot of alcohol and high quality. In order to specify multiple conditions, we have to place each condition in parentheses (...)
, and separate conditions with an ampersand &
:
# create a boolean array for high alcohol content and high quality
high_alcohol_and_quality = (wines[:,11] > 7) & (wines[:,10] > 10)
print(high_alcohol_and_quality)
# use boolean indexing to select out the wines
wines[high_alcohol_and_quality,:]
[ True True True ... True True True]
array([[ 7.4 , 0.7 , 0. , ..., 0.56, 50. , 15. ],
[ 7.8 , 0.88, 0. , ..., 0.68, 50. , 15. ],
[11.2 , 0.28, 0.56, ..., 0.58, 50. , 16. ],
...,
[ 6.3 , 0.51, 0.13, ..., 0.75, 50. , 16. ],
[ 5.9 , 0.65, 0.12, ..., 0.71, 50. , 15. ],
[ 6. , 0.31, 0.47, ..., 0.66, 50. , 16. ]])
We can combine subsetting and assignment to overwrite certain values in an array:
high_alcohol_and_quality = (wines[:,10] > 10) & (wines[:,11] > 7)
wines[high_alcohol_and_quality,10:] = 20
Reshaping NumPy Arrays
We can change the shape of arrays while still preserving all of their elements. This often can make it easier to access array elements. The simplest reshaping is to flip the axes, so rows become columns, and vice versa. We can accomplish this with the numpy.transpose
function:
np.transpose(wines).shape
(12, 1598)
We can use the numpy.ravel
function to turn an array into a one-dimensional representation. It will essentially flatten an array into a long sequence of values:
wines.ravel()
array([ 7.4 , 0.7 , 0. , ..., 0.66, 20. , 20. ])
Here’s an example where we can see the ordering of numpy.ravel
:
array_one = np.array(
[
[1, 2, 3, 4],
[5, 6, 7, 8]
]
)
array_one.ravel()
array([1, 2, 3, 4, 5, 6, 7, 8])
Finally, we can use the numpy.reshape function to reshape an array to a certain shape we specify. The below code will turn the second row of wines into a 2-dimensional array with 2 rows and 6 columns:
# print the current shape of the 2nd row and all columns
wines[1,:].shape
(12,)
# reshape the 2nd row to a 2 by 6 matrix
wines[1,:].reshape((2,6))
array([[ 7.8 , 0.88, 0. , 2.6 , 0.1 , 10. ],
[67. , 1. , 3.2 , 0.68, 20. , 20. ]])
Combining NumPy Arrays
With NumPy, it’s very common to combine multiple arrays into a single unified array. We can use numpy.vstack
to vertically stack multiple arrays.
Think of it like the second arrays’s items being added as new rows to the first array. We can read in the winequality-white.csv
dataset that contains information on the quality of white wines, then combine it with our existing dataset, wines, which contains information on red wines.
In the below code, we:
- Read in
winequality-white.csv
. - Display the shape of white_wines.
white_wines = np.genfromtxt("winequality-white.csv", delimiter=";", skip_header=1)
white_wines.shape
(4898, 12)
As you can see, we have attributes for 4898 wines. Now that we have the white wines data, we can combine all the wine data.
In the below code, we:
- Use the
vstack
function to combine wines and white_wines. - Display the shape of the result.
all_wines = np.vstack((wines, white_wines))
all_wines.shape
(6496, 12)
As you can see, the result has 6497 rows, which is the sum of the number of rows in wines and the number of rows in red_wines.
If we want to combine arrays horizontally, where the number of rows stay constant, but the columns are joined, then we can use the numpy.hstack
function. The arrays we combine need to have the same number of rows for this to work.
Finally, we can use numpy.concatenate
as a general purpose version of hstack
and vstack
. If we want to concatenate two arrays, we pass them into concatenate, then specify the axis keyword argument that we want to concatenate along.
- Concatenating along the first axis is similar to
vstack
- Concatenating along the second axis is similar to
hstack
:
x = np.concatenate((wines, white_wines), axis=0)
print(x.shape, x)
(6496, 12) [[ 7.4 0.7 0. ... 0.56 20. 20. ]
[ 7.8 0.88 0. ... 0.68 20. 20. ]
[11.2 0.28 0.56 ... 0.58 20. 20. ]
...
[ 6.5 0.24 0.19 ... 0.46 9.4 6. ]
[ 5.5 0.29 0.3 ... 0.38 12.8 7. ]
[ 6. 0.21 0.38 ... 0.32 11.8 6. ]]
Broadcasting
Unless the arrays that you’re operating on are the exact same size, it’s not possible to do elementwise operations. In cases like this, NumPy performs broadcasting to try to match up elements. Essentially, broadcasting involves a few steps:
- The last dimension of each array is compared.
- If the dimension lengths are equal, or one of the dimensions is of length 1, then we keep going.
- If the dimension lengths aren’t equal, and none of the dimensions have length 1, then there’s an error.
- Continue checking dimensions until the shortest array is out of dimensions.
For example, the following two shapes are compatible:
A: (50,3)
B (3,)
This is because the length of the trailing dimension of array A is 3, and the length of the trailing dimension of array B is 3. They’re equal, so that dimension is okay. Array B is then out of elements, so we’re okay, and the arrays are compatible for mathematical operations.
The following two shapes are also compatible:
A: (1,2)
B (50,2)
The last dimension matches, and A is of length 1 in the first dimension.
These two arrays don’t match:
A: (50,50)
B: (49,49)
The lengths of the dimensions aren’t equal, and neither array has either dimension length equal to 1.
There’s a detailed explanation of broadcasting here, but we’ll go through a few examples to illustrate the principle:
wines * np.array([1,2])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/jd/pq0swyt521jb2424d6fvth840000gn/T/ipykernel_13638/2228080465.py in <module>
----> 1 wines * np.array([1,2])
ValueError: operands could not be broadcast together with shapes (1598,12) (2,)
The above example didn’t work because the two arrays don’t have a matching trailing dimension. Here’s an example where the last dimension does match:
array_one = np.array(
[
[1,2],
[3,4]
]
)
array_two = np.array([4,5])
array_one + array_two
array([[5, 7],
[7, 9]])
As you can see, array_two has been broadcasted across each row of array_one. Here’s an example with our wines data:
rand_array = np.random.rand(12)
wines + rand_array
array([[ 7.57, 1.1 , 0.66, ..., 0.59, 20.33, 20.78],
[ 7.97, 1.28, 0.66, ..., 0.71, 20.33, 20.78],
[11.37, 0.68, 1.22, ..., 0.61, 20.33, 20.78],
...,
[ 6.47, 0.91, 0.79, ..., 0.78, 20.33, 20.78],
[ 6.07, 1.04, 0.78, ..., 0.74, 20.33, 20.78],
[ 6.17, 0.71, 1.13, ..., 0.69, 20.33, 20.78]])