Sungmin Son
Sungmin Son

Reputation: 65

Subset multiindex dataframe keeps original index value

I found subsetting multi-index dataframe will keep original index values behind. Here is the sample code for test.

level_one = ["foo","bar","baz"]
level_two = ["a","b","c"]
df_index = pd.MultiIndex.from_product((level_one,level_two))
df = pd.DataFrame(range(9), index = df_index, columns=["number"])
df

Above code will show dataframe like this.

       number
foo a       0
    b       1
    c       2
bar a       3
    b       4
    c       5
baz a       6
    b       7
    c       8

Code below subset the dataframe to contain only 'a' and 'b' for index level 1.

df_subset = df.query("(number%3) <=1")
df_subset
       number
foo a       0
    b       1
bar a       3
    b       4
baz a       6
    b       7

The dataframe itself is expected result. BUT index level of it is still containing the original index level, which is NOT expected.

#Following code is still returnning index 'c'
df_subset.index.levels[1]
#Result
Index(['a', 'b', 'c'], dtype='object')

My first question is how can I remove the 'original' index after subsetting? The Second question is this is expected behavior for pandas?

Thanks

Upvotes: 1

Views: 114

Answers (2)

Adrien Riaux
Adrien Riaux

Reputation: 533

It is normal that the "original" index after subsetting remains, because it's a behavior of pandas, according to the documentation "The MultiIndex keeps all the defined levels of an index, even if they are not actually used.This is done to avoid a recomputation of the levels in order to make slicing highly performant."

You can see that the index levels is a FrozenList using:

[I]: df_subset.index.levels
[O]: FrozenList([['bar', 'baz', 'foo'], ['a', 'b', 'c']])

If you want to see only the used levels, you can use the get_level_values() or the unique() methods. Here some example:

[I]: df_subset.index.get_level_values(level=1)
[O]: Index(['a', 'b', 'a', 'b', 'a', 'b'], dtype='object')

[I]: df_subset.index.unique(level=1)
[O]: Index(['a', 'b'], dtype='object')

Hope it can help you!

Upvotes: 2

mozway
mozway

Reputation: 261015

Yes, this is expected, it can allow you to access the missing levels after filtering. You can remove the unused levels with remove_unused_levels:

df_subset.index = df_subset.index.remove_unused_levels()

print(df_subset.index.levels[1])

Output:

Index(['a', 'b'], dtype='object')

Upvotes: 3

Related Questions