Reputation: 12152
I want to remove a subset of rows from a Pandas DataFrame
based on a groupby()
inspection.
The primary DataFrame
:
>>> df
name day fruit foobar
0 Tim 1 Apple 0
1 Tim 1 Apple 1
2 Tim 2 Apple 2
3 Anna 1 Banana 3
4 Anna 1 Strawberry 4
5 Bob 1 Strawberry 5
6 Bob 2 Apple 6
7 Bob 2 Kiwi 7
I group this by name
and fruit
and filter by foobar > 5
.
g = df.groupby(['name', 'fruit']).agg(sum)
g = g[(g.foobar > 5)]
Result is
>>> g
name fruit
Bob Apple 2 6
Kiwi 2 7
Now I want to remove the related rows from df
so that the primary DataFrame
looks like this at the end:
>>> df
name day fruit foobar
0 Tim 1 Apple 0
1 Tim 1 Apple 1
2 Tim 2 Apple 2
3 Anna 1 Banana 3
4 Anna 1 Strawberry 4
5 Bob 1 Strawberry 5
It does not matter for me if name
and fruit
is the index or just a usual column.
Is this what is called a left outer join?
This is the full code to create the sample data.
import pandas as pd
data = {'name': ['Tim', 'Tim', 'Tim', 'Anna', 'Anna', 'Bob', 'Bob', 'Bob'],
'day': [1, 1, 2, 1, 1, 1, 2, 2],
'fruit': ['Apple', 'Apple', 'Apple', 'Banana', 'Strawberry',
'Strawberry', 'Apple', 'Kiwi'],
'foobar': range(8),
}
# That is the initial DataFrame
df = pd.DataFrame(data)
g = df.groupby(['name', 'day']).agg(sum)
g = g[(g.foobar > 5)]
Upvotes: 2
Views: 2302
Reputation: 862581
Use GroupBy.transform
for repeat aggregated values per groups, so possible filter in boolean indexing
with inverted mask >
to <=
:
df = df[df.groupby(['name', 'fruit']).foobar.transform(sum) <= 5]
print (df)
name day fruit foobar
0 Tim 1 Apple 0
1 Tim 1 Apple 1
2 Tim 2 Apple 2
3 Anna 1 Banana 3
4 Anna 1 Strawberry 4
5 Bob 1 Strawberry 5
Upvotes: 3
Reputation: 12152
Not sure if this is an elegant solution. Based on this tutorial.
Booth DataFrame
need to have the same index.
df.set_index(['name', 'fruit'])
After that simple drop()
by index..
df = df.drop(g.indes)
Upvotes: 0