Reputation: 8773
I wonder why pandas
has a large memory usage when reindexing a Series.
I create a simple dataset:
a = pd.Series(np.arange(5e7, dtype=np.double))
According to top
on my Ubuntu, the whole session is about 820MB.
Now if I slice this to extract the first 100 elements:
a_sliced = a[:100]
This shows no increased memory consumption.
Instead if I reindex a
on the same range:
a_reindexed = a.reindex(np.arange(100))
I get a memory consumption of about 1.8GB. Tried also to cleanup with gc.collect
without success.
I would like to know if this is expected and if there is a workaround to reindex large datasets without significant memory overhead.
I am using a very recent snapshot of pandas
from github.
Upvotes: 3
Views: 2315
Reputation: 129018
Be very very careful setting copy=False
FYI. This can cause some weird effects. Copying the index is 'cheap' if your data is large relative to the index size (which looks like it is).
If you want to eliminate the memory associated after reindexing, do something like this:
s = a_big_series
s2 = s.reindex(....)
Memory is still used because the underlying data is just a view of the old data (dependening on how you are slicing it. It could be copy, but this is numpy dependent).
s2 = s.reindex(....).copy()
del s
This will release the memory
Upvotes: 2
Reputation: 97331
Index uses a Hashtable to map labels to locations. You can check this by Series.index._engine.mapping
. This mapping is created when necessary. If the index is_monotonic
, you can use asof()
:
import numpy as np
import pandas as pd
idx =["%07d" % x for x in range(int(2e6))]
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
new_index = ["0000003", "0000020", "000002a"]
print a.index._engine.mapping # None
print a.reindex(new_index)
print a.index._engine.mapping # <pandas.hashtable.PyObjectHashTable object at ...>
a = pd.Series(np.arange(2e6, dtype=np.double), index=idx)
print a.asof(new_index)
print a.index._engine.mapping # None
If you want more control about not exist labels, you can use searchsorted()
and do the logic yourself:
>>> a.index[a.index.searchsorted(new_index)]
Index([u'0000003', u'0000020', u'0000030'], dtype=object)
Upvotes: 2