ead
ead

Reputation: 34326

Reasons for differences in memory consumption and performances of np.zeros and np.full

When measuring memory consumption of np.zeros:

import psutil
import numpy as np

process = psutil.Process()
N=10**8
start_rss = process.memory_info().rss
a = np.zeros(N, dtype=np.float64)
print("memory for a", process.memory_info().rss - start_rss)

the result is unexpected 8192 bytes, i.e almost 0, while 1e8 doubles would need 8e8 bytes.

When replacing np.zeros(N, dtype=np.float64) by np.full(N, 0.0, dtype=np.float64) the memory needed for a are 800002048 bytes.

There are similar discrepancies in running times:

import numpy as np
N=10**8
%timeit np.zeros(N, dtype=np.float64)
# 11.8 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.full(N, 0.0, dtype=np.float64)
# 419 ms ± 7.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I.e. np.zeros is up to 40 times faster for big sizes.

Not sure these differences are for all architectures/operating systems, but I've observed it at least for x86-64 Windows and Linux.

Which differences between np.zeros and np.full can explain different memory consumption and different running times?

Upvotes: 7

Views: 1204

Answers (2)

wim
wim

Reputation: 362587

I don't trust psutil for these memory benchmarks, and rss (Resident Set Size) may not be the right metric in the first place.

Using stdlib tracemalloc you can get correct looking numbers for memory allocation - it should be approx an 800000000 bytes delta for this N and float64 dtype:

>>> import numpy as np
>>> import tracemalloc
>>> N = 10**8
>>> tracemalloc.start()
>>> tracemalloc.get_traced_memory()  # current, peak
(159008, 1874350)
>>> a = np.zeros(N, dtype=np.float64)
>>> tracemalloc.get_traced_memory()
(800336637, 802014880)

For the timing differences between np.full and np.zeros, compare the man pages for malloc and calloc, i.e. the np.zeros is able to go to an allocation routine which gets zeroed pages. See PyArray_Zeros --> calls PyArray_NewFromDescr_int passing in 1 for the zeroed argument, which then has a special case for allocating zeros faster:

if (zeroed || PyDataType_FLAGCHK(descr, NPY_NEEDS_INIT)) {
    data = npy_alloc_cache_zero(nbytes);
}
else {
    data = npy_alloc_cache(nbytes);
}

It looks like np.full does not have this fast path. There the performance will be similar to first doing an init and then doing a copy O(n):

a = np.empty(N, dtype=np.float64)
a[:] = np.float64(0.0)

numpy devs could presumably have added a fast path to np.full if the fill value was zero, but why bother to add another way to do the same thing - users could just use np.zeros in the first place.

Upvotes: 3

Laurent GRENIER
Laurent GRENIER

Reputation: 632

The numpy.zeros function straight uses the C code layer of the Numpy library while the functions ones and full works as same by initializing an array of values and copying the desired value in it.

Then the zeros function doesn't need any language interpretation while for the others, ones and full, the Python code need to be interpreted as C code.

Have a look on the source code to figure it out by yourself: https://github.com/numpy/numpy/blob/master/numpy/core/numeric.py

Upvotes: 1

Related Questions