What extra data is stored in Series and DataFrame objects?

Question

I'm interested in configuring or patching pandas so that its memory overhead is as low as possible. In an experiment I created 2 numpy arrays, each containing 50 million uint32 values. Storing these arrays in numpy format needs 200 + 200 = 400 Mbytes. If I wrap one of the arrays into a Series object (with index=None), then it consumes ~600 Mbytes of memory. If I wrap the two arrays into a DataFrame object (with index=None), then the memory requirement is ~1600 Mbytes.

It seems that the extra memory requirement is #rows * 8 bytes for Series storage, and #rows * (#columns + 1) * 8 bytes for DataFrame storage. Can you explain what extra data exactly is stored in the Series and the DataFrame object along with the original numpy arrays?

What extra data is stored in Series and DataFrame objects?

Answers (1)

Related Questions