user1289813
user1289813

Reputation: 11

What extra data is stored in Series and DataFrame objects?

I'm interested in configuring or patching pandas so that its memory overhead is as low as possible. In an experiment I created 2 numpy arrays, each containing 50 million uint32 values. Storing these arrays in numpy format needs 200 + 200 = 400 Mbytes. If I wrap one of the arrays into a Series object (with index=None), then it consumes ~600 Mbytes of memory. If I wrap the two arrays into a DataFrame object (with index=None), then the memory requirement is ~1600 Mbytes.

It seems that the extra memory requirement is #rows * 8 bytes for Series storage, and #rows * (#columns + 1) * 8 bytes for DataFrame storage. Can you explain what extra data exactly is stored in the Series and the DataFrame object along with the original numpy arrays?

Upvotes: 1

Views: 222

Answers (1)

Wes McKinney
Wes McKinney

Reputation: 105481

The extra storage is due to the row indexes being stored as 64-bit integers. There is an open issue to address this memory usage issue for your use case: https://github.com/pydata/pandas/issues/939

Upvotes: 1

Related Questions