Reputation: 1061
I finally found a performance bottleneck in my code but am confused as to what the reason is. To solve it I changed all my calls of numpy.zeros_like
to instead use numpy.zeros
. But why is zeros_like
sooooo much slower?
For example (note e-05
on the zeros
call):
>>> timeit.timeit('np.zeros((12488, 7588, 3), np.uint8)', 'import numpy as np', number = 10)
5.2928924560546875e-05
>>> timeit.timeit('np.zeros_like(x)', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10)
1.4402990341186523
But then strangely writing to an array created with zeros
is noticeably slower than an array created with zeros_like
:
>>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10)
0.4310588836669922
>>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros_like(np.zeros((12488, 7588, 3), np.uint8))', number = 10)
0.33325695991516113
My guess is zeros
is using some CPU trick and not actually writing to the memory to allocate it. This is done on the fly when it's written to. But that still doesn't explain the massive discrepancy in array creation times.
I'm running Mac OS X Yosemite with the current numpy version:
>>> numpy.__version__
'1.9.1'
Upvotes: 29
Views: 30844
Reputation: 231375
My timings in Ipython are (with a simplier timeit interface):
In [57]: timeit np.zeros_like(x)
1 loops, best of 3: 420 ms per loop
In [58]: timeit np.zeros((12488, 7588, 3), np.uint8)
100000 loops, best of 3: 15.1 µs per loop
When I look at the code with IPython (np.zeros_like??
) I see:
res = empty_like(a, dtype=dtype, order=order, subok=subok)
multiarray.copyto(res, 0, casting='unsafe')
while np.zeros
is a blackbox - pure compiled code.
Timings for empty
are:
In [63]: timeit np.empty_like(x)
100000 loops, best of 3: 13.6 µs per loop
In [64]: timeit np.empty((12488, 7588, 3), np.uint8)
100000 loops, best of 3: 14.9 µs per loop
So the extra time in zeros_like
is in that copy
.
In my tests, the difference in assignment times (x[]=1
) is negligible.
My guess is that zeros
, ones
, empty
are all early compiled creations. empty_like
was added as a convenience, just drawing shape and type info from its input. zeros_like
was written with more of an eye toward easy programming maintenance (reusing empty_like
) than for speed.
np.ones
and np.full
also use the np.empty ... copyto
sequence, and show similar timings.
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/array_assign_scalar.c
appears to be file that copies a scalar (such as 0
) to an array. I don't see a use of memset
.
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c has calls to malloc
and calloc
.
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c - source for zeros
and empty
. Both call PyArray_NewFromDescr_int
, but one ends up using npy_alloc_cache_zero
and the other npy_alloc_cache
.
npy_alloc_cache
in alloc.c
calls alloc
. npy_alloc_cache_zero
calls npy_alloc_cache
followed by a memset
. Code in alloc.c
is further confused with a THREAD option.
More on the calloc
v malloc+memset
difference at:
Why malloc+memset is slower than calloc?
But with caching and garbage collection, I wonder whether the calloc/memset
distinction applies.
This simple test with the memory_profile
package supports the claim that zeros
and empty
allocate memory 'on-the-fly', while zeros_like
allocates everything up front:
N = (1000, 1000)
M = (slice(None, 500, None), slice(500, None, None))
Line # Mem usage Increment Line Contents
================================================
2 17.699 MiB 0.000 MiB @profile
3 def test1(N, M):
4 17.699 MiB 0.000 MiB print(N, M)
5 17.699 MiB 0.000 MiB x = np.zeros(N) # no memory jump
6 17.699 MiB 0.000 MiB y = np.empty(N)
7 25.230 MiB 7.531 MiB z = np.zeros_like(x) # initial jump
8 29.098 MiB 3.867 MiB x[M] = 1 # jump on usage
9 32.965 MiB 3.867 MiB y[M] = 1
10 32.965 MiB 0.000 MiB z[M] = 1
11 32.965 MiB 0.000 MiB return x,y,z
Upvotes: 25
Reputation: 35125
Modern OS allocate memory virtually, ie., memory is given to a process only when it is first used. zeros
obtains memory from the operating system so that the OS zeroes it when it is first used. zeros_like
on the other hand fills the alloced memory with zeros by itself. Both ways require about same amount of work --- it's just that with zeros_like
the zeroing is done upfront, whereas zeros
ends up doing it on the fly.
Technically, in C the difference is calling calloc
vs. malloc+memset
.
Upvotes: 27