Reputation: 303
I am trying to delete a grouped set of rows in pandas according to the following condition:
If a group (grouped by col1) has more than 2 values 'c' in col2, then remove the whole group.
What I have looks like this
col1 col2
0 A 10:10
1 A 20:05
2 A c
3 A 00:10
4 B 04:15
2 B c
3 B c
4 B 13:40
And I am trying to get here:
col1 col2
0 A 10:10
1 A 20:05
2 A c
3 A 00:10
Typically I do this for other very similar dataframes (and it works):
df = df.groupby('col1').filter(lambda x: x["col2"].value_counts()[['c']].sum() < 2)
But for this one is not working and I receive this error:
KeyError: "None of [Index(['c'], dtype='object')] are in the [index]"
Does someone have an idea on how I could do this?
Thanks!
Upvotes: 5
Views: 18377
Reputation: 862681
I suggest use for improve performance boolean indexing
:
df = df[df['col2'].eq('c').groupby(df['col1']).transform('sum').lt(2)]
print (df)
col1 col2
0 A 10:10
1 A 20:05
2 A c
3 A 00:10
Details:
First compare values by Series.eq
for ==
:
print (df['col2'].eq('c'))
0 False
1 False
2 True
3 False
4 False
2 True
3 True
4 False
Name: col2, dtype: bool
Then count True
value per groups by GroupBy.transform
with sum
, True
s are processing like 1
:
print (df['col2'].eq('c').groupby(df['col1']).transform('sum'))
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
2 2.0
3 2.0
4 2.0
Name: col2, dtype: float64
And last filter by Series.lt
for less:
print (df['col2'].eq('c').groupby(df['col1']).transform('sum').lt(2))
0 True
1 True
2 True
3 True
4 False
2 False
3 False
4 False
Name: col2, dtype: bool
Upvotes: 3
Reputation: 16683
Here is another method using loc
to create a temporary dataframe that takes the count of 'c' values in 'col2' grouped by 'col1' and sends the counts '<2' to a list. Then do another loc on the original dataframe to filter for the 'col1' group using .isin()
if they were flagged as a column with less than 2 counts of 'c' per group from the temporary dataframe/list 'dft':
dft = df.loc[df['col2'] == 'c'].groupby('col1').count().reset_index()
dft = dft.loc[dft['col2'] < 2, 'col1'].to_list()
df = df.loc[df['col1'].isin(dft)]
df
Upvotes: 2
Reputation: 19947
You can use len:
df.groupby('col1').filter(lambda x: len(x.loc[x.col2.eq('c')])<2)
col1 col2
0 A 10:10
1 A 20:05
2 A c
3 A 00:10
Your solution actually also works for me:
df.groupby('col1').filter(lambda x: x["col2"].value_counts()[['c']].sum() < 2)
If it still doesn't work, you can try the following to see if it helps.
df.astype(str).groupby('col1').filter(lambda x: x["col2"].value_counts()[['c']].sum() < 2)
Upvotes: 1