Andrea Zonca
Andrea Zonca

Reputation: 8773

pandas memory usage when reindexing

I wonder why pandas has a large memory usage when reindexing a Series.

I create a simple dataset:

a = pd.Series(np.arange(5e7, dtype=np.double))

According to top on my Ubuntu, the whole session is about 820MB.

Now if I slice this to extract the first 100 elements:

a_sliced = a[:100]

This shows no increased memory consumption.

Instead if I reindex a on the same range:

a_reindexed = a.reindex(np.arange(100))

I get a memory consumption of about 1.8GB. Tried also to cleanup with gc.collect without success.

I would like to know if this is expected and if there is a workaround to reindex large datasets without significant memory overhead.

I am using a very recent snapshot of pandas from github.

Upvotes: 3

Views: 2315

Answers (2)

Jeff
Jeff

Reputation: 129018

Be very very careful setting copy=False FYI. This can cause some weird effects. Copying the index is 'cheap' if your data is large relative to the index size (which looks like it is).

If you want to eliminate the memory associated after reindexing, do something like this:

s = a_big_series
s2 = s.reindex(....)

Memory is still used because the underlying data is just a view of the old data (dependening on how you are slicing it. It could be copy, but this is numpy dependent).

s2 = s.reindex(....).copy()
del s

This will release the memory

Upvotes: 2

HYRY
HYRY

Reputation: 97331

Index uses a Hashtable to map labels to locations. You can check this by Series.index._engine.mapping. This mapping is created when necessary. If the index is_monotonic, you can use asof():

import numpy as np
import pandas as pd
idx =["%07d" % x for x in range(int(2e6))]
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
new_index = ["0000003", "0000020", "000002a"]

print a.index._engine.mapping # None
print a.reindex(new_index)
print a.index._engine.mapping # <pandas.hashtable.PyObjectHashTable object at ...>

a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
print a.asof(new_index)
print a.index._engine.mapping # None

If you want more control about not exist labels, you can use searchsorted() and do the logic yourself:

>>> a.index[a.index.searchsorted(new_index)] 
Index([u'0000003', u'0000020', u'0000030'], dtype=object)

Upvotes: 2

Related Questions