Reputation:
I am trying to filter a pandas data frame which has 2 levels of columns header using a list in the fairly standard way of using isin
and a mask to retun a new filtered dataframe. However I keep running into a value error around overlapping index names which I can't seem to solve.
Here are examples, of my list I'd like to filter on, as well as the dataframe itself. Not sure how to provide the code needed to generate this dataframe as it is imported from an csv file.
my_list = ['DEF', 'GHI']
+------+--------+----------+---+----------+---+
| CODE | NAME | STANDARD | | ADVANCED | |
+------+--------+----------+---+----------+---+
| CODE | NAME | A | B | A | B |
+------+--------+----------+---+----------+---+
| ABC | APPLE | 1 | 1 | 2 | 2 |
| DEF | BANANA | 3 | 3 | 4 | 4 |
| GHI | KIWI | 5 | 5 | 6 | 6 |
| JKL | MANGO | 7 | 7 | 8 | 8 |
+------+--------+----------+---+----------+---+
running the following line to filter
new_df = df[df['CODE'].isin(my_list)]
returns
ValueError: cannot join with no overlapping index names
however if I run df['CODE'].isin(my_list)
by itself, this seems to work and returns the Symbol
column with True/False values, which means something goes wrong when applying the mask to the original dataframe.
df source code
[{('CODE', 'CODE'): 'ABC',
('NAME', 'NAME'): 'APPLE',
('STANDARD', 'A'): 1,
('STANDARD', 'B'): 1,
('ADVANCED', 'A'): 2,
('ADVANCED', 'B'): 2},
{('CODE', 'CODE'): 'DEF',
('NAME', 'NAME'): 'BANANA',
('STANDARD', 'A'): 3,
('STANDARD', 'B'): 3,
('ADVANCED', 'A'): 4,
('ADVANCED', 'B'): 4},
{('CODE', 'CODE'): 'GHI',
('NAME', 'NAME'): 'KIWI',
('STANDARD', 'A'): 5,
('STANDARD', 'B'): 5,
('ADVANCED', 'A'): 6,
('ADVANCED', 'B'): 6},
{('CODE', 'CODE'): 'JKL',
('NAME', 'NAME'): 'MANGO',
('STANDARD', 'A'): 7,
('STANDARD', 'B'): 7,
('ADVANCED', 'A'): 8,
('ADVANCED', 'B'): 8}]
Upvotes: 4
Views: 844
Reputation: 323306
When there are multiple index
I usually slice with position
my_list = ['DEF', 'GHI']
df.iloc[:,0].isin(my_list)
Out[49]:
0 False
1 True
2 True
3 False
Name: (CODE, CODE), dtype: bool
Upvotes: 3
Reputation: 35646
Take a look at what df['CODE'].isin(my_list)
produces:
CODE
0 False
1 True
2 True
3 False
Compared to a single level index:
my_list = ['DEF', 'GHI']
df = pd.DataFrame({
'CODE': ['ABC', 'DEF', 'GHI', 'JKL']
})
print(df['CODE'].isin(my_list))
0 False
1 True
2 True
3 False
Name: CODE, dtype: bool
Fix
Either grab the index out of CODE:
new_df = df[df['CODE'].isin(my_list)['CODE']]
Or reference the MultiIndex directly:
new_df = df[df[('CODE', 'CODE')].isin(my_list)]
Both produce:
new_df
:
CODE NAME STANDARD ADVANCED
CODE NAME A B A B
1 DEF BANANA 3 3 4 4
2 GHI KIWI 5 5 6 6
Complete Working Example:
import pandas as pd
my_list = ['DEF', 'GHI']
df = pd.DataFrame({
('CODE', 'CODE'): ['ABC', 'DEF', 'GHI', 'JKL'],
('NAME', 'NAME'): ['APPLE', 'BANANA', 'KIWI', 'MANGO'],
('STANDARD', 'A'): [1, 3, 5, 7], ('STANDARD', 'B'): [1, 3, 5, 7],
('ADVANCED', 'A'): [2, 4, 6, 8], ('ADVANCED', 'B'): [2, 4, 6, 8]
})
new_df = df[df[('CODE', 'CODE')].isin(my_list)]
print(new_df)
Upvotes: 4