Xaume
Xaume

Reputation: 281

Drop duplicated index names in MultiIndex dataframe

After performing some operation on a dataset I ended up having a data frame like this:

df = pd.DataFrame([1, 2, 3, 4, 5], 
              index = pd.MultiIndex.from_arrays(np.array([[0, 0, 1, 1, 1],
                                                          [1448, 1448, 1448, 1448, 1448],
                                                          [0, 0, 1, 1, 1],
                                                          [1448, 1448, 1448, 1448, 1448],
                                                          ['22', '11', '14', '15', '19']]), 
                                                names = ['meas_id', 'hec_id', 'meas_id', 'hec_id', 'lab_id']))

print(df)

                                      0
meas_id hec_id meas_id hec_id lab_id   
0       1448   0       1448   22      1
                              11      2
1       1448   1       1448   14      3
                              15      4
                              19      5

As you can notice, meas_id and hec_id columns of MultiIndex are duplicated. I could use reset_index([3, 4], drop=True) to drop them, but this resulting data frame is actually a part of a list of data frames, where some of them have those index columns duplicated and some do not. Therefore my question is: how can I drop duplicated index columns from MultiIndex?

Please note that this question refer to duplicates in columns, not in row.

Upvotes: 0

Views: 893

Answers (2)

BobVitorBob
BobVitorBob

Reputation: 68

The accepted answer solves the problem, but what I found is that this usually happens due to a misuse of pandas' functions.

Take the following food grading dataframe as example

import pandas as pd
from random import randint

df = pd.DataFrame({
    'food_id': [*list(range(5)), *list(range(5))],
    'grade': [randint(0, 10) for _ in range(10)],
}, index=pd.Index([*[10 for _ in range(5)], *[20 for _ in range(5)]], name='person_id'))
print(df)

            food_id  grade
person_id                
10                0      2
10                1      5
10                2      2
10                3      6
10                4      0
20                0      4
20                1     10
20                2      2
20                3      3
20                4      1

If I wanted to find what's the best graded food for each person, I could first groupby on person_id and get the row with the largest grade.

print(df.groupby(level=0).apply(lambda g: g.nlargest(1, 'grade')))
                      food_id  grade
person_id  person_id                
10         10                3      6
20         20                1     10

but I end up with duplicated IDs, since apply returns the row with its ID, and the original dataframe also has an ID. If I set as_index=False on the group_by call, the original object resets its index but I still get two levels.

print(df.groupby(level=0, as_index=False).apply(lambda g: g.nlargest(1, 'grade')))
                 food_id  grade
  person_id                
0 10               0      4
1 20               3     10

Resetting the index inside the apply yields a similar result.

print(df.groupby(level=0).apply(lambda g: g.nlargest(1, 'grade').reset_index(drop=True)))
                 food_id  grade
person_id                  
10        0        0      4
20        0        3     10

The solution here is to set group_keys=False on groupby. If the apply returns rows with a similar index, it uses that instead of the original index.

print(df.groupby(level=0, group_keys=False).apply(lambda g: g.nlargest(1, 'grade')))
           food_id  grade
person_id                
10               0      4
20               3     10

If food_id is an index, we can skip the apply, using the nlargest function of SeriesGroupBy, like so.

print(df.set_index('food_id', append=True).groupby(level='person_id', group_keys=False)['grade'].nlargest(1))
person_id  food_id
10         0           4
20         3          10
Name: grade, dtype: int64

What I'm getting at is that duplicate index levels are (mostly) the result of a misuse of pandas' operations, and the better solution is to tinker with your operations in order to fix the problem instead of letting it happen and then finding a fix.

Upvotes: 0

yatu
yatu

Reputation: 88236

For a solution that generalises in dropping duplicate index names you can do:

import numpy as np

ixs = np.flatnonzero(pd.Index(df.index.names).duplicated()).tolist()
df.reset_index(ixs, drop=True)

                       0
meas_id hec_id lab_id   
0       1448   22      1
               11      2
1       1448   14      3
               15      4
               19      5

Upvotes: 2

Related Questions