TY Lim
TY Lim

Reputation: 623

Subset of Pandas MultiIndex works for whole index but not for specific level?

I've run into strange behaviour with a pd.MultiIndex and am trying to understand what's going on. Not looking for a solution so much as an explanation.

Suppose I have a MultiIndexed dataframe:

index0 = pd.Index(['a', 'b', 'c'], name='let')
index1 = pd.Index(['foo', 'bar', 'baz'], name='word')
x = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=[index0, index1])

display(x)

        0   1   2
let word            
a   foo 1   2   3
b   bar 4   5   6
c   baz 7   8   9

If I then take a subset of that dataframe with df.loc:

sub = ['a', 'c']
y = x.loc[sub]
display(y)

        0   1   2
let word            
a   foo 1   2   3
c   baz 7   8   9

So far, so good. Now, looking at the index of the new dataframe:

display(y.index)

MultiIndex([('a', 'foo'),
            ('c', 'baz')],
           names=['let', 'word'])

That makes sense too. But if I look at a specific level of the subset dataframe's index...

display(y.index.levels[1])

Index(['bar', 'baz', 'foo'], dtype='object', name='word')

Suddenly I have the values of the original full dataframe, not the selected subset!

Why does this happen?

Upvotes: 1

Views: 102

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150765

I think you are confusing levels and get_level_values:

y.index.get_level_values(1)
# Index(['foo', 'baz'], dtype='object', name='word')

y.index.levels is as Ben mentioned in his answer, just all possible values (before truncated). Let's see another example:

df = pd.DataFrame([[0]], 
                  index=pd.MultiIndex.from_product([[0,1],[0,1,2]]))

So df would look like:

     0
0 0  0
  1  0
  2  0
1 0  0
  1  0
  2  0

Now what do you think we would get with df.index.levels[1]? The answer is:

Int64Index([0, 1, 2], dtype='int64')

which consists of all the possible values in the level. Whereas, df.index.get_level_values(1) gives:

Int64Index([0, 1, 2, 0, 1, 2], dtype='int64')

Upvotes: 1

BENY
BENY

Reputation: 323306

We need to add a specific function remove_unused_levels to this , since it is category type data

y.index.levels[0]
Index(['a', 'b', 'c'], dtype='object', name='let')

# after add
y.index=y.index.remove_unused_levels()
y.index.levels[0]
Index(['a', 'c'], dtype='object', name='let')

Upvotes: 2

Related Questions