Reputation: 34326
When measuring memory consumption of np.zeros
:
import psutil
import numpy as np
process = psutil.Process()
N=10**8
start_rss = process.memory_info().rss
a = np.zeros(N, dtype=np.float64)
print("memory for a", process.memory_info().rss - start_rss)
the result is unexpected 8192
bytes, i.e almost 0, while 1e8 doubles would need 8e8 bytes.
When replacing np.zeros(N, dtype=np.float64)
by np.full(N, 0.0, dtype=np.float64)
the memory needed for a
are 800002048
bytes.
There are similar discrepancies in running times:
import numpy as np
N=10**8
%timeit np.zeros(N, dtype=np.float64)
# 11.8 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.full(N, 0.0, dtype=np.float64)
# 419 ms ± 7.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I.e. np.zeros
is up to 40 times faster for big sizes.
Not sure these differences are for all architectures/operating systems, but I've observed it at least for x86-64 Windows and Linux.
Which differences between np.zeros
and np.full
can explain different memory consumption and different running times?
Upvotes: 7
Views: 1204
Reputation: 362587
I don't trust psutil
for these memory benchmarks, and rss (Resident Set Size) may not be the right metric in the first place.
Using stdlib tracemalloc
you can get correct looking numbers for memory allocation - it should be approx an 800000000 bytes delta for this N and float64 dtype:
>>> import numpy as np
>>> import tracemalloc
>>> N = 10**8
>>> tracemalloc.start()
>>> tracemalloc.get_traced_memory() # current, peak
(159008, 1874350)
>>> a = np.zeros(N, dtype=np.float64)
>>> tracemalloc.get_traced_memory()
(800336637, 802014880)
For the timing differences between np.full
and np.zeros
, compare the man pages for malloc
and calloc
, i.e. the np.zeros
is able to go to an allocation routine which gets zeroed pages. See PyArray_Zeros
--> calls PyArray_NewFromDescr_int
passing in 1
for the zeroed
argument, which then has a special case for allocating zeros faster:
if (zeroed || PyDataType_FLAGCHK(descr, NPY_NEEDS_INIT)) {
data = npy_alloc_cache_zero(nbytes);
}
else {
data = npy_alloc_cache(nbytes);
}
It looks like np.full
does not have this fast path. There the performance will be similar to first doing an init and then doing a copy O(n):
a = np.empty(N, dtype=np.float64)
a[:] = np.float64(0.0)
numpy
devs could presumably have added a fast path to np.full
if the fill value was zero, but why bother to add another way to do the same thing - users could just use np.zeros
in the first place.
Upvotes: 3
Reputation: 632
The numpy.zeros function straight uses the C code layer of the Numpy library while the functions ones and full works as same by initializing an array of values and copying the desired value in it.
Then the zeros function doesn't need any language interpretation while for the others, ones and full, the Python code need to be interpreted as C code.
Have a look on the source code to figure it out by yourself: https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py
Upvotes: 1