pstatix
pstatix

Reputation: 3858

How does array.array() use such little memory space?

I'm not sure why the array.array() class uses so little memory than reported by sys.getsizeof:

from array import array
a = array('f')
for i in range(500000):
    a.append(float(i))
sys.getsizeof(a)
# 2100228
sum(sys.getsizeof(i) for i in a)
# 12000000 (makes sense, 24 bytes * 500K)
# 2100228 + 12000000 = 14100228
# 14100228 / 1000 = 14,100.228KB
# 14,100.228 / 1000 = 14.1MB

However, examining the process in the task manager, the programs memory only increases by 3MB. So how is the process only using 3MB more but the object takes up 14.1MB?

Upvotes: 0

Views: 185

Answers (3)

abarnert
abarnert

Reputation: 365975

A Python float is a fully-featured object, which knows its type (so it has methods) and can be garbage collected and so on. In CPython (the Python implementation you’re probably using), this works by storing a pointer to the type object (8 bytes) and a reference count (8 more bytes) along with the actual IEEE float64 value (8 more bytes), so it’s at least 24 bytes long.

A list just stores references to Python objects. So, a list of a half million floats will take a bit over 4MB for the list itself (storing all those references), plus all those referenced float objects will take another 12MB in total.

An array.array doesn’t store float objects, it just stores the bits of the IEEE float64 value (8 bytes), and then creates those float objects on the fly whenever you ask for one with, e.g., arr[0]. This makes it a lot smaller—the whole thing only takes 4MB in memory—but also slower.1

And of course you aren’t even storing an array of IEEE float64 (that’s d, not f), but float32. Half a million of those takes 2MB.

If you want the best of both worlds, the third-party library NumPy can store the bits the same way array.array does, and it can do calculations on those bits without having to create and destroy float objects all over the place, so it’s both smaller and faster.


So, when you ask for the size of an array of 500K f floats, that’s 2MB, because it stores only the 500K native IEEE float32 values (plus a few dozen bytes of fixed overhead).

But when you loop over that array, counting the size of each member, you’re actually creating 24-byte float objects on the fly. The total size of all of those temporary objects is 12MB. But they’re temporary—as soon as you check the size of each one, you forget about it, it becomes garbage and gets cleaned up, and the same 24 bytes can be reused for the next one.


As for why Task Manager shows your memory go up by 3MB:

Almost every program works by having a heap of memory, allocating out of that heap, and only asking the OS for more memory in large chunks when it needs more. (CPython makes this even more complicated by having two custom heaps on top of the basic one, but don’t worry about that.)

So, let’s say the interpreter has 2MB of space left over in its heap, and you ask it to allocate a 4MB object. It needs to go back to Windows and ask for at least 2MB more memory. It gets a little more than it needs (so it won’t immediately need to go back and ask for more), and that turns out to be about 3MB. Of course this is just one of many ways you could end up getting 3MB from the OS, and figuring out exactly which one happened requires complicated debugging (more complicated than doing more useful things, like just tracking the actual heap use of your program).

As you can see, this makes measuring memory usage by Task Manager pretty useless, except for very broad strokes. (And it’s actually even worse than that, once you get into questions like when Python returns free memory to Windows, what happens when memory is fragmented, whether the OS overcommits, when pages can and can’t be remapped in virtual memory, and all kinds of other complexities.)


1. Although it’s not always slower. Sometimes being more compact in memory gives you so much advantage in caching or virtual memory that it more than makes up for the wasted time creating and destroying objects all over the place.

Upvotes: 4

user2357112
user2357112

Reputation: 281683

Your a array does not actually contain any of the objects produced by for i in a. Those objects are generated on access. a contains raw 32-bit floats, not float objects.

Upvotes: 3

tif
tif

Reputation: 1484

As written in the docs, "the array module defines an object type which can compactly represent an array of basic values: characters, integers, floating point numbers". That is a[i] will need to store the type info, while for the whole a array you need to store it only once.

Upvotes: 0

Related Questions