Reputation:

Filtering a pandas data frame which has 2 level of column headers

I am trying to filter a pandas data frame which has 2 levels of columns header using a list in the fairly standard way of using isin and a mask to retun a new filtered dataframe. However I keep running into a value error around overlapping index names which I can't seem to solve.

Here are examples, of my list I'd like to filter on, as well as the dataframe itself. Not sure how to provide the code needed to generate this dataframe as it is imported from an csv file.

my_list = ['DEF', 'GHI']

+------+--------+----------+---+----------+---+
| CODE |  NAME  | STANDARD |   | ADVANCED |   |
+------+--------+----------+---+----------+---+
| CODE | NAME   | A        | B | A        | B |
+------+--------+----------+---+----------+---+
| ABC  | APPLE  | 1        | 1 | 2        | 2 |
| DEF  | BANANA | 3        | 3 | 4        | 4 |
| GHI  | KIWI   | 5        | 5 | 6        | 6 |
| JKL  | MANGO  | 7        | 7 | 8        | 8 |
+------+--------+----------+---+----------+---+

running the following line to filter

new_df = df[df['CODE'].isin(my_list)]

returns

ValueError: cannot join with no overlapping index names

however if I run df['CODE'].isin(my_list) by itself, this seems to work and returns the Symbol column with True/False values, which means something goes wrong when applying the mask to the original dataframe.

df source code

[{('CODE', 'CODE'): 'ABC',
  ('NAME', 'NAME'): 'APPLE',
  ('STANDARD', 'A'): 1,
  ('STANDARD', 'B'): 1,
  ('ADVANCED', 'A'): 2,
  ('ADVANCED', 'B'): 2},
 {('CODE', 'CODE'): 'DEF',
  ('NAME', 'NAME'): 'BANANA',
  ('STANDARD', 'A'): 3,
  ('STANDARD', 'B'): 3,
  ('ADVANCED', 'A'): 4,
  ('ADVANCED', 'B'): 4},
 {('CODE', 'CODE'): 'GHI',
  ('NAME', 'NAME'): 'KIWI',
  ('STANDARD', 'A'): 5,
  ('STANDARD', 'B'): 5,
  ('ADVANCED', 'A'): 6,
  ('ADVANCED', 'B'): 6},
 {('CODE', 'CODE'): 'JKL',
  ('NAME', 'NAME'): 'MANGO',
  ('STANDARD', 'A'): 7,
  ('STANDARD', 'B'): 7,
  ('ADVANCED', 'A'): 8,
  ('ADVANCED', 'B'): 8}]

Upvotes: 4

Answers (3)

rhug123

Reputation: 8768

Here is another way:

df.loc[df['CODE'].isin(my_list).squeeze()]

Upvotes: 1

BENY

Reputation: 323306

When there are multiple index I usually slice with position

my_list = ['DEF', 'GHI']
df.iloc[:,0].isin(my_list)
Out[49]: 
0    False
1     True
2     True
3    False
Name: (CODE, CODE), dtype: bool

Upvotes: 3

Henry Ecker

Reputation: 35646

Take a look at what df['CODE'].isin(my_list) produces:

    CODE
0  False
1   True
2   True
3  False

Compared to a single level index:

my_list = ['DEF', 'GHI']

df = pd.DataFrame({
    'CODE': ['ABC', 'DEF', 'GHI', 'JKL']
})
print(df['CODE'].isin(my_list))

0    False
1     True
2     True
3    False
Name: CODE, dtype: bool

Fix

Either grab the index out of CODE:

new_df = df[df['CODE'].isin(my_list)['CODE']]

Or reference the MultiIndex directly:

new_df = df[df[('CODE', 'CODE')].isin(my_list)]

Both produce:

new_df:

  CODE    NAME STANDARD    ADVANCED   
  CODE    NAME        A  B        A  B
1  DEF  BANANA        3  3        4  4
2  GHI    KIWI        5  5        6  6

Complete Working Example:

import pandas as pd

my_list = ['DEF', 'GHI']

df = pd.DataFrame({
    ('CODE', 'CODE'): ['ABC', 'DEF', 'GHI', 'JKL'],
    ('NAME', 'NAME'): ['APPLE', 'BANANA', 'KIWI', 'MANGO'],
    ('STANDARD', 'A'): [1, 3, 5, 7], ('STANDARD', 'B'): [1, 3, 5, 7],
    ('ADVANCED', 'A'): [2, 4, 6, 8], ('ADVANCED', 'B'): [2, 4, 6, 8]
})

new_df = df[df[('CODE', 'CODE')].isin(my_list)]
print(new_df)

Upvotes: 4

Filtering a pandas data frame which has 2 level of column headers

Answers (3)

Related Questions