Strange timings on missing values in Python numpy

Question

I have a long 1-d numpy array with some 10% missing values. I want to change its missing values (np.nan) to other values repeatedly. I know of two ways to do this:

data[np.isnan(data)] = 0 or the function np.copyto(data, 0, where=np.isnan(data))

Sometimes I want to put zeros there, other times I want to restore the nans. I thought that recomputing the np.isnan function repeatedly would be slow, and it would be better to save the locations of the nans. Some of the timing results of the code below are counter-intuitive.

I ran the following:

import numpy as np
import sys
print(sys.version)
print(sys.version_info)
print(f'numpy version {np.__version__}')
data = np.random.random(100000)
data[data<0.1] = 0
data[data==0] = np.nan
%timeit missing = np.isnan(data)
%timeit wheremiss = np.where(np.isnan(data))
missing = np.isnan(data)
wheremiss = np.where(np.isnan(data))
print("Use missing list store 0")
%timeit data[missing] = 0
data[data==0] = np.nan
%timeit data[wheremiss] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=missing)
print("Use isnan function store 0")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=np.isnan(data))
print("Use missing list store np.nan")
data[data==0] = np.nan
%timeit data[missing] = np.nan
data[data==0] = np.nan
%timeit data[wheremiss] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=missing)
print("Use isnan function store np.nan")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=np.isnan(data))

And I got the following output (I have taken the liberty to add numbers to the timing lines, so that I can refer to them later):

3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
numpy version 1.17.1
01. 30 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
02. 219 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use missing list store 0
03. 339 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
04. 26 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
05. 287 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store 0
06. 38.5 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
07. 43.8 µs ± 4.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use missing list store np.nan
08. 328 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
09. 24.8 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10. 322 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store np.nan
11. 356 µs ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
12. 300 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So here is the first question. Why would it take nearly 10 times longer to store a np.nan than to store a 0? (compare lines 6 and 7 vs. lines 11 and 12)

Why would it take much longer to use a stored list of missing as compared to recomputing the missing values using the isnan function? (compare lines 3 and 5 vs. 6 and 7)

This is just for curiosity. I can see that the fastest way is to use np.where to get a list of indices (because I have only 10% missing). But if I had many more, things might not be so obvious.

Sam Mason · Accepted Answer

because you're not measuring what you think you are! you're mutating your data while doing the test, and timeit runs the test multiple times. thus additional runs are running on changed data. when you change the value to 0 the next time you run isnan you get nothing back and assignment is basically a no-op. while when you're assigning nan this causes more work to be done in the next iteration.

your question about when to use np.where vs leaving it as an array of bools is a bit more difficult. it would involve the relative sizes of the different datatypes (e.g. bool is 1 byte, int64 is 8 bytes), the proportion of values that are selected, how well the distribution matches up to the CPU/memory subsystem's optimisations (e.g. are they mostly in one block vs. uniformly distributed), the relative cost of doing np.where vs how many times the result will be reused, and other things I can't think of right now.

for other users, it might be worth pointing out that RAM latency (i.e. speed) is >100 times slower than L1 cache, so keeping memory access predictable is important to maximize cache utilisation

Strange timings on missing values in Python numpy

Answers (1)

Related Questions