buhtz
buhtz

Reputation: 12152

Remove rows based on a Pandas groupby() result

I want to remove a subset of rows from a Pandas DataFrame based on a groupby() inspection.

The primary DataFrame:

>>> df
   name  day       fruit  foobar
0   Tim    1       Apple       0
1   Tim    1       Apple       1
2   Tim    2       Apple       2
3  Anna    1      Banana       3
4  Anna    1  Strawberry       4
5   Bob    1  Strawberry       5
6   Bob    2       Apple       6
7   Bob    2        Kiwi       7

I group this by name and fruit and filter by foobar > 5.

g = df.groupby(['name', 'fruit']).agg(sum)
g = g[(g.foobar > 5)]

Result is

>>> g
name fruit
Bob  Apple    2       6
     Kiwi     2       7

Now I want to remove the related rows from df so that the primary DataFrame looks like this at the end:

>>> df
   name  day       fruit  foobar
0   Tim    1       Apple       0
1   Tim    1       Apple       1
2   Tim    2       Apple       2
3  Anna    1      Banana       3
4  Anna    1  Strawberry       4
5   Bob    1  Strawberry       5

It does not matter for me if name and fruit is the index or just a usual column.

Is this what is called a left outer join?

This is the full code to create the sample data.

import pandas as pd

data = {'name': ['Tim', 'Tim', 'Tim', 'Anna', 'Anna', 'Bob', 'Bob', 'Bob'],
        'day': [1, 1, 2, 1, 1, 1, 2, 2],
        'fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Strawberry',
                  'Strawberry', 'Apple', 'Kiwi'],
        'foobar': range(8),
}

# That is the initial DataFrame
df = pd.DataFrame(data)

g = df.groupby(['name', 'day']).agg(sum)
g = g[(g.foobar > 5)]

Upvotes: 2

Views: 2302

Answers (2)

jezrael
jezrael

Reputation: 862581

Use GroupBy.transform for repeat aggregated values per groups, so possible filter in boolean indexing with inverted mask > to <=:

df = df[df.groupby(['name', 'fruit']).foobar.transform(sum) <= 5]
print (df)
   name  day       fruit  foobar
0   Tim    1       Apple       0
1   Tim    1       Apple       1
2   Tim    2       Apple       2
3  Anna    1      Banana       3
4  Anna    1  Strawberry       4
5   Bob    1  Strawberry       5

Upvotes: 3

buhtz
buhtz

Reputation: 12152

Not sure if this is an elegant solution. Based on this tutorial.

Booth DataFrame need to have the same index.

df.set_index(['name', 'fruit'])

After that simple drop() by index..

df = df.drop(g.indes)

Upvotes: 0

Related Questions