Reputation: 623
I've run into strange behaviour with a pd.MultiIndex
and am trying to understand what's going on. Not looking for a solution so much as an explanation.
Suppose I have a MultiIndexed dataframe:
index0 = pd.Index(['a', 'b', 'c'], name='let')
index1 = pd.Index(['foo', 'bar', 'baz'], name='word')
x = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=[index0, index1])
display(x)
0 1 2
let word
a foo 1 2 3
b bar 4 5 6
c baz 7 8 9
If I then take a subset of that dataframe with df.loc
:
sub = ['a', 'c']
y = x.loc[sub]
display(y)
0 1 2
let word
a foo 1 2 3
c baz 7 8 9
So far, so good. Now, looking at the index of the new dataframe:
display(y.index)
MultiIndex([('a', 'foo'),
('c', 'baz')],
names=['let', 'word'])
That makes sense too. But if I look at a specific level of the subset dataframe's index...
display(y.index.levels[1])
Index(['bar', 'baz', 'foo'], dtype='object', name='word')
Suddenly I have the values of the original full dataframe, not the selected subset!
Why does this happen?
Upvotes: 1
Views: 102
Reputation: 150765
I think you are confusing levels
and get_level_values
:
y.index.get_level_values(1)
# Index(['foo', 'baz'], dtype='object', name='word')
y.index.levels
is as Ben mentioned in his answer, just all possible values (before truncated). Let's see another example:
df = pd.DataFrame([[0]],
index=pd.MultiIndex.from_product([[0,1],[0,1,2]]))
So df
would look like:
0
0 0 0
1 0
2 0
1 0 0
1 0
2 0
Now what do you think we would get with df.index.levels[1]
? The answer is:
Int64Index([0, 1, 2], dtype='int64')
which consists of all the possible values in the level. Whereas, df.index.get_level_values(1)
gives:
Int64Index([0, 1, 2, 0, 1, 2], dtype='int64')
Upvotes: 1
Reputation: 323306
We need to add a specific function remove_unused_levels
to this , since it is category type data
y.index.levels[0]
Index(['a', 'b', 'c'], dtype='object', name='let')
# after add
y.index=y.index.remove_unused_levels()
y.index.levels[0]
Index(['a', 'c'], dtype='object', name='let')
Upvotes: 2