Reputation: 1439
I have a long 1-d numpy array with some 10% missing values. I want to change its missing values (np.nan) to other values repeatedly. I know of two ways to do this:
data[np.isnan(data)] = 0
or the function
np.copyto(data, 0, where=np.isnan(data))
Sometimes I want to put zeros there, other times I want to restore the nan
s.
I thought that recomputing the np.isnan
function repeatedly would be slow, and it would be better to save the locations of the nans. Some of the timing results of the code below are counter-intuitive.
I ran the following:
import numpy as np
import sys
print(sys.version)
print(sys.version_info)
print(f'numpy version {np.__version__}')
data = np.random.random(100000)
data[data<0.1] = 0
data[data==0] = np.nan
%timeit missing = np.isnan(data)
%timeit wheremiss = np.where(np.isnan(data))
missing = np.isnan(data)
wheremiss = np.where(np.isnan(data))
print("Use missing list store 0")
%timeit data[missing] = 0
data[data==0] = np.nan
%timeit data[wheremiss] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=missing)
print("Use isnan function store 0")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=np.isnan(data))
print("Use missing list store np.nan")
data[data==0] = np.nan
%timeit data[missing] = np.nan
data[data==0] = np.nan
%timeit data[wheremiss] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=missing)
print("Use isnan function store np.nan")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=np.isnan(data))
And I got the following output (I have taken the liberty to add numbers to the timing lines, so that I can refer to them later):
3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
numpy version 1.17.1
01. 30 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
02. 219 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use missing list store 0
03. 339 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
04. 26 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
05. 287 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store 0
06. 38.5 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
07. 43.8 µs ± 4.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use missing list store np.nan
08. 328 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
09. 24.8 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10. 322 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store np.nan
11. 356 µs ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
12. 300 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So here is the first question. Why would it take nearly 10 times longer to store a np.nan
than to store a 0? (compare lines 6 and 7 vs. lines 11 and 12)
Why would it take much longer to use a stored list of missing
as compared to recomputing the missing values using the isnan
function? (compare lines 3 and 5 vs. 6 and 7)
This is just for curiosity. I can see that the fastest way is to use np.where
to get a list of indices (because I have only 10% missing). But if I had many more, things might not be so obvious.
Upvotes: 1
Views: 84
Reputation: 16184
because you're not measuring what you think you are! you're mutating your data
while doing the test, and timeit
runs the test multiple times. thus additional runs are running on changed data. when you change the value to 0
the next time you run isnan
you get nothing back and assignment is basically a no-op. while when you're assigning nan
this causes more work to be done in the next iteration.
your question about when to use np.where
vs leaving it as an array of bool
s is a bit more difficult. it would involve the relative sizes of the different datatypes (e.g. bool
is 1 byte, int64
is 8 bytes), the proportion of values that are selected, how well the distribution matches up to the CPU/memory subsystem's optimisations (e.g. are they mostly in one block vs. uniformly distributed), the relative cost of doing np.where
vs how many times the result will be reused, and other things I can't think of right now.
for other users, it might be worth pointing out that RAM latency (i.e. speed) is >100 times slower than L1 cache, so keeping memory access predictable is important to maximize cache utilisation
Upvotes: 1