Reputation: 13768
Let's say I have a large dataframe large
that has a MultiIndex on the rows. I pare down this dataframe by selecting only some of the rows and assign the result to small
. In particular, small
has fewer distinct values in the 0th level of its MultiIndex on the rows than large
.
I then want a list of the distinct values in the 0th level of the MultiIndex on the rows of small
so I call small.index.levels[0]
. The result is strange: it returns the same thing as large.index.levels[0]
despite the fact that there should be fewer values.
What's going on?
MWE:
import pandas as pd
import numpy as np
np.random.seed(0)
idx = pd.MultiIndex.from_product([['John', 'Josh', 'Alex'], list('abcde')],
names=['Person', 'Letter'])
large = pd.DataFrame(data=np.random.randn(15, 2),
index=idx,
columns=['one', 'two'])
small = large.loc[['Jo'==d[0:2] for d in large.index.get_level_values('Person')]]
print small.index.levels[0]
print large.index.levels[0]
Output:
Index([u'Alex', u'John', u'Josh'], dtype='object')
Index([u'Alex', u'John', u'Josh'], dtype='object')
Expected output:
Index([u'John', u'Josh'], dtype='object')
Index([u'Alex', u'John', u'Josh'], dtype='object')
Upvotes: 4
Views: 364
Reputation: 336
I found this question after having the same problem, posted it as a bug on the pandas issues tracker, and was told it's expected behaviour as pandas only updates codes when slicing MultiIndex
, not levels. You can use MultiIndex.remove_unused_levels()
(api link) to drop the levels that are no longer in the slice.
So in your example:
small = large.loc[['Jo'==d[0:2] for d in large.index.get_level_values('Person')]]
small.index = small.index.remove_unused_levels()
print(small.index.levels[0])
print(large.index.levels[0])
Upvotes: 0
Reputation: 129008
More efficient to do this.
In [43]: large[large.index.get_level_values('Person').to_series().str.startswith('Jo').values]
Out[43]:
one two
Person Letter
John a 1.764052 0.400157
b 0.978738 2.240893
c 1.867558 -0.977278
d 0.950088 -0.151357
e -0.103219 0.410599
Josh a 0.144044 1.454274
b 0.761038 0.121675
c 0.443863 0.333674
d 1.494079 -0.205158
e 0.313068 -0.854096
To answer your question. That is an implementation detail. Use .get_level_values()
(rather than accessing the internal .levels
You can do this if you want.
In [13]: small.index.get_level_values('Person').unique()
Out[13]: array(['John', 'Josh'], dtype=object)
In [14]: large.index.get_level_values('Person').unique()
Out[14]: array(['John', 'Josh', 'Alex'], dtype=object)
Upvotes: 1