pandas memory usage when reindexing

Question

I wonder why pandas has a large memory usage when reindexing a Series.

I create a simple dataset:

a = pd.Series(np.arange(5e7, dtype=np.double))

According to top on my Ubuntu, the whole session is about 820MB.

Now if I slice this to extract the first 100 elements:

a_sliced = a[:100]

This shows no increased memory consumption.

Instead if I reindex a on the same range:

a_reindexed = a.reindex(np.arange(100))

I get a memory consumption of about 1.8GB. Tried also to cleanup with gc.collect without success.

I would like to know if this is expected and if there is a workaround to reindex large datasets without significant memory overhead.

I am using a very recent snapshot of pandas from github.

HYRY · Accepted Answer

Index uses a Hashtable to map labels to locations. You can check this by Series.index._engine.mapping. This mapping is created when necessary. If the index is_monotonic, you can use asof():

import numpy as np
import pandas as pd
idx =["%07d" % x for x in range(int(2e6))]
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
new_index = ["0000003", "0000020", "000002a"]

print a.index._engine.mapping # None
print a.reindex(new_index)
print a.index._engine.mapping # 

a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
print a.asof(new_index)
print a.index._engine.mapping # None

If you want more control about not exist labels, you can use searchsorted() and do the logic yourself:

>>> a.index[a.index.searchsorted(new_index)] 
Index([u'0000003', u'0000020', u'0000030'], dtype=object)

pandas memory usage when reindexing

Answers (2)

Related Questions