Reputation: 281
After performing some operation on a dataset I ended up having a data frame like this:
df = pd.DataFrame([1, 2, 3, 4, 5],
index = pd.MultiIndex.from_arrays(np.array([[0, 0, 1, 1, 1],
[1448, 1448, 1448, 1448, 1448],
[0, 0, 1, 1, 1],
[1448, 1448, 1448, 1448, 1448],
['22', '11', '14', '15', '19']]),
names = ['meas_id', 'hec_id', 'meas_id', 'hec_id', 'lab_id']))
print(df)
0
meas_id hec_id meas_id hec_id lab_id
0 1448 0 1448 22 1
11 2
1 1448 1 1448 14 3
15 4
19 5
As you can notice, meas_id
and hec_id
columns of MultiIndex
are duplicated. I could use reset_index([3, 4], drop=True)
to drop them, but this resulting data frame is actually a part of a list of data frames, where some of them have those index columns duplicated and some do not. Therefore my question is: how can I drop duplicated index columns from MultiIndex?
Please note that this question refer to duplicates in columns, not in row.
Upvotes: 0
Views: 893
Reputation: 68
The accepted answer solves the problem, but what I found is that this usually happens due to a misuse of pandas' functions.
Take the following food grading dataframe as example
import pandas as pd
from random import randint
df = pd.DataFrame({
'food_id': [*list(range(5)), *list(range(5))],
'grade': [randint(0, 10) for _ in range(10)],
}, index=pd.Index([*[10 for _ in range(5)], *[20 for _ in range(5)]], name='person_id'))
print(df)
food_id grade
person_id
10 0 2
10 1 5
10 2 2
10 3 6
10 4 0
20 0 4
20 1 10
20 2 2
20 3 3
20 4 1
If I wanted to find what's the best graded food for each person, I could first groupby
on person_id
and get the row with the largest grade.
print(df.groupby(level=0).apply(lambda g: g.nlargest(1, 'grade')))
food_id grade
person_id person_id
10 10 3 6
20 20 1 10
but I end up with duplicated IDs, since apply
returns the row with its ID, and the original dataframe also has an ID. If I set as_index=False
on the group_by
call, the original object resets its index but I still get two levels.
print(df.groupby(level=0, as_index=False).apply(lambda g: g.nlargest(1, 'grade')))
food_id grade
person_id
0 10 0 4
1 20 3 10
Resetting the index inside the apply
yields a similar result.
print(df.groupby(level=0).apply(lambda g: g.nlargest(1, 'grade').reset_index(drop=True)))
food_id grade
person_id
10 0 0 4
20 0 3 10
The solution here is to set group_keys=False
on groupby
. If the apply
returns rows with a similar index, it uses that instead of the original index.
print(df.groupby(level=0, group_keys=False).apply(lambda g: g.nlargest(1, 'grade')))
food_id grade
person_id
10 0 4
20 3 10
If food_id
is an index, we can skip the apply
, using the nlargest
function of SeriesGroupBy
, like so.
print(df.set_index('food_id', append=True).groupby(level='person_id', group_keys=False)['grade'].nlargest(1))
person_id food_id
10 0 4
20 3 10
Name: grade, dtype: int64
What I'm getting at is that duplicate index levels are (mostly) the result of a misuse of pandas' operations, and the better solution is to tinker with your operations in order to fix the problem instead of letting it happen and then finding a fix.
Upvotes: 0
Reputation: 88236
For a solution that generalises in dropping duplicate index names you can do:
import numpy as np
ixs = np.flatnonzero(pd.Index(df.index.names).duplicated()).tolist()
df.reset_index(ixs, drop=True)
0
meas_id hec_id lab_id
0 1448 22 1
11 2
1 1448 14 3
15 4
19 5
Upvotes: 2