Reputation: 11
I'm interested in configuring or patching pandas so that its memory overhead is as low as possible. In an experiment I created 2 numpy arrays, each containing 50 million uint32 values. Storing these arrays in numpy format needs 200 + 200 = 400 Mbytes. If I wrap one of the arrays into a Series object (with index=None), then it consumes ~600 Mbytes of memory. If I wrap the two arrays into a DataFrame object (with index=None), then the memory requirement is ~1600 Mbytes.
It seems that the extra memory requirement is #rows * 8 bytes for Series storage, and #rows * (#columns + 1) * 8 bytes for DataFrame storage. Can you explain what extra data exactly is stored in the Series and the DataFrame object along with the original numpy
arrays?
Upvotes: 1
Views: 222
Reputation: 105481
The extra storage is due to the row indexes being stored as 64-bit integers. There is an open issue to address this memory usage issue for your use case: https://github.com/pydata/pandas/issues/939
Upvotes: 1