Error: None of [Index(['...'], dtype='object')] are in the [index]

I am trying to delete a grouped set of rows in pandas according to the following condition:

If a group (grouped by col1) has more than 2 values 'c' in col2, then remove the whole group.

What I have looks like this

  col1  col2                       
0  A     10:10 
1  A     20:05
2  A     c
3  A     00:10
4  B     04:15
2  B     c
3  B     c
4  B     13:40

And I am trying to get here:

  col1  col2                       
0  A     10:10 
1  A     20:05
2  A     c
3  A     00:10

Typically I do this for other very similar dataframes (and it works):

df = df.groupby('col1').filter(lambda x: x["col2"].value_counts()[['c']].sum() < 2)

But for this one is not working and I receive this error:

KeyError: "None of [Index(['c'], dtype='object')] are in the [index]"

Does someone have an idea on how I could do this?

Thanks!

Upvotes: 5

Views: 18377

Answers (3)

jezrael
jezrael

Reputation: 862681

I suggest use for improve performance boolean indexing:

df = df[df['col2'].eq('c').groupby(df['col1']).transform('sum').lt(2)]
print (df)
  col1   col2
0    A  10:10
1    A  20:05
2    A      c
3    A  00:10

Details:

First compare values by Series.eq for ==:

print (df['col2'].eq('c'))
0    False
1    False
2     True
3    False
4    False
2     True
3     True
4    False
Name: col2, dtype: bool

Then count True value per groups by GroupBy.transform with sum, Trues are processing like 1:

print (df['col2'].eq('c').groupby(df['col1']).transform('sum'))
0    1.0
1    1.0
2    1.0
3    1.0
4    2.0
2    2.0
3    2.0
4    2.0
Name: col2, dtype: float64

And last filter by Series.lt for less:

print (df['col2'].eq('c').groupby(df['col1']).transform('sum').lt(2))
0     True
1     True
2     True
3     True
4    False
2    False
3    False
4    False
Name: col2, dtype: bool

Upvotes: 3

David Erickson
David Erickson

Reputation: 16683

Here is another method using loc to create a temporary dataframe that takes the count of 'c' values in 'col2' grouped by 'col1' and sends the counts '<2' to a list. Then do another loc on the original dataframe to filter for the 'col1' group using .isin() if they were flagged as a column with less than 2 counts of 'c' per group from the temporary dataframe/list 'dft':

dft = df.loc[df['col2'] == 'c'].groupby('col1').count().reset_index()
dft = dft.loc[dft['col2'] < 2, 'col1'].to_list()
df = df.loc[df['col1'].isin(dft)]
df

Upvotes: 2

Allen Qin
Allen Qin

Reputation: 19947

You can use len:

df.groupby('col1').filter(lambda x: len(x.loc[x.col2.eq('c')])<2)

    col1    col2
0   A   10:10
1   A   20:05
2   A   c
3   A   00:10

Your solution actually also works for me:

df.groupby('col1').filter(lambda x: x["col2"].value_counts()[['c']].sum() < 2)

If it still doesn't work, you can try the following to see if it helps.

df.astype(str).groupby('col1').filter(lambda x: x["col2"].value_counts()[['c']].sum() < 2)

Upvotes: 1

Related Questions