iwtbid
iwtbid

Reputation: 85

pandas filtering rows by group value

Here is the data that I'm practicing with

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

I want to filter individual rows by a grouped valued. I know I can do the following to filter groups

df.groupby("day").filter(lambda x: x['total_bill'].mean() > 20).day.unique()

Which finds which days have an average bill greater than $20. This works because groupby.filter takes a function to apply to each subframe that should return True or False. However, what if I want to find every meal (row) whose value for total_bill is greater than that day's total_bill. Eg if a row has total_bill of 22 and was on Sunday, then it should be kept because Sunday's total_bill average was 21.41.

This is my attempt:

df.groupby('day').apply(lambda x: x['total_bill'] > x['total_bill'].mean())

However, this yields something that looks like this (first few rows)

day    
Fri  90     True
     91     True
     92    False
     93    False
     94     True
Name: total_bill, dtype: bool

This is not in the same order as the dataframe so I can't just take the boolean column and use it to index the data.

So now I do the following:

grouped = (df
           .groupby('day')
           .apply(lambda x: x['total_bill'] > x['total_bill'].mean())
           .reset_index())

index_bill = (grouped
             .loc[grouped.total_bill == True, 'level_1'].values)
df.loc[index_bill]

Which gives me the desired result... there has to be an easier way, right? Please let me know if there is a proper way to do this. If not, is there at least a way I can combine the two steps into one? I can do the groupby, but am not sure how to get the values without storing the grouped object as a variable and then referencing it. Thanks!

Upvotes: 4

Views: 3861

Answers (2)

jezrael
jezrael

Reputation: 862641

For same output need transform for return same Series of mean per group, then filter by boolean indexing and last add sort_values:

a=df[df['total_bill'] > df.groupby('day')['total_bill'].transform('mean')].sort_values('day')
print (a.head(20))
     total_bill   tip     sex smoker  day    time  size
90        28.97  3.00    Male    Yes  Fri  Dinner     2
91        22.49  3.50    Male     No  Fri  Dinner     2
94        22.75  3.25  Female     No  Fri  Dinner     2
95        40.17  4.73    Male    Yes  Fri  Dinner     4
96        27.28  4.00    Male    Yes  Fri  Dinner     2
98        21.01  3.00    Male    Yes  Fri  Dinner     2
102       44.30  2.50  Female    Yes  Sat  Dinner     3
206       26.59  3.41    Male    Yes  Sat  Dinner     3
229       22.12  2.88  Female    Yes  Sat  Dinner     2
227       20.45  3.00    Male     No  Sat  Dinner     4
219       30.14  3.09  Female    Yes  Sat  Dinner     4
237       32.83  1.17    Male    Yes  Sat  Dinner     2
103       22.42  3.48  Female    Yes  Sat  Dinner     2
106       20.49  4.06    Male    Yes  Sat  Dinner     2
107       25.21  4.29    Male    Yes  Sat  Dinner     2
216       28.15  3.00    Male    Yes  Sat  Dinner     5
214       28.17  6.50  Female    Yes  Sat  Dinner     3
241       22.67  2.00    Male    Yes  Sat  Dinner     2
212       48.33  9.00    Male     No  Sat  Dinner     4
211       25.89  5.16    Male    Yes  Sat  Dinner     4

EDIT:

For correct ordering by days is possible use ordered categorical:

cats = ['Mon','Tue','Wed','Thur','Fri','Sat','Sun']
df['day'] = pd.Categorical(df['day'], categories=cats, ordered=True)
means = df.groupby('day')['total_bill'].transform('mean')
df1 = df[df['total_bill'] > means].sort_values('day')
print (df1.head(20))
     total_bill   tip     sex smoker   day   time  size
129       22.82  2.18    Male     No  Thur  Lunch     3
80        19.44  3.00    Male    Yes  Thur  Lunch     2
83        32.68  5.00    Male    Yes  Thur  Lunch     2
85        34.83  5.17  Female     No  Thur  Lunch     4
87        18.28  4.00    Male     No  Thur  Lunch     2
88        24.71  5.85    Male     No  Thur  Lunch     2
89        21.16  3.00    Male     No  Thur  Lunch     2
119       24.08  2.92  Female     No  Thur  Lunch     4
125       29.80  4.20  Female     No  Thur  Lunch     6
130       19.08  1.50    Male     No  Thur  Lunch     2
78        22.76  3.00    Male     No  Thur  Lunch     2
131       20.27  2.83  Female     No  Thur  Lunch     2
141       34.30  6.70    Male     No  Thur  Lunch     6
142       41.19  5.00    Male     No  Thur  Lunch     5
143       27.05  5.00  Female     No  Thur  Lunch     6
146       18.64  1.36  Female     No  Thur  Lunch     3
191       19.81  4.19  Female    Yes  Thur  Lunch     2
192       28.44  2.56    Male    Yes  Thur  Lunch     2
197       43.11  5.00  Female    Yes  Thur  Lunch     4
200       18.71  4.00    Male    Yes  Thur  Lunch     3

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153460

The best way I think to do is to use boolean indexing with groupby and transfrom. First you group by day to find mean for that day, then use transform to apply that mean to each row, next compare that mean to the actual total_billed on that day, then use that boolean series to filter your dataframe with boolean indexing.

df[df.groupby('day')['total_bill'].transform('mean') < df['total_bill']]

Output:

     total_bill   tip     sex smoker   day    time  size
3         23.68  3.31    Male     No   Sun  Dinner     2
4         24.59  3.61  Female     No   Sun  Dinner     4
5         25.29  4.71    Male     No   Sun  Dinner     4
7         26.88  3.12    Male     No   Sun  Dinner     4
11        35.26  5.00  Female     No   Sun  Dinner     4
15        21.58  3.92    Male     No   Sun  Dinner     2
19        20.65  3.35    Male     No   Sat  Dinner     3
23        39.42  7.58    Male     No   Sat  Dinner     4
28        21.70  4.30    Male     No   Sat  Dinner     2
33        20.69  2.45  Female     No   Sat  Dinner     4
35        24.06  3.60    Male     No   Sat  Dinner     3
39        31.27  5.00    Male     No   Sat  Dinner     3
44        30.40  5.60    Male     No   Sun  Dinner     4
46        22.23  5.00    Male     No   Sun  Dinner     2
47        32.40  6.00    Male     No   Sun  Dinner     4
48        28.55  2.05    Male     No   Sun  Dinner     3
52        34.81  5.20  Female     No   Sun  Dinner     4
54        25.56  4.34    Male     No   Sun  Dinner     4
56        38.01  3.00    Male    Yes   Sat  Dinner     4
57        26.41  1.50  Female     No   Sat  Dinner     2
59        48.27  6.73    Male     No   Sat  Dinner     4
72        26.86  3.14  Female    Yes   Sat  Dinner     2
73        25.28  5.00  Female    Yes   Sat  Dinner     2
77        27.20  4.00    Male     No  Thur   Lunch     4
78        22.76  3.00    Male     No  Thur   Lunch     2
80        19.44  3.00    Male    Yes  Thur   Lunch     2
83        32.68  5.00    Male    Yes  Thur   Lunch     2
85        34.83  5.17  Female     No  Thur   Lunch     4
87        18.28  4.00    Male     No  Thur   Lunch     2
88        24.71  5.85    Male     No  Thur   Lunch     2
..          ...   ...     ...    ...   ...     ...   ...
180       34.65  3.68    Male    Yes   Sun  Dinner     4
181       23.33  5.65    Male    Yes   Sun  Dinner     2
182       45.35  3.50    Male    Yes   Sun  Dinner     3
183       23.17  6.50    Male    Yes   Sun  Dinner     4
184       40.55  3.00    Male    Yes   Sun  Dinner     2
187       30.46  2.00    Male    Yes   Sun  Dinner     5
189       23.10  4.00    Male    Yes   Sun  Dinner     3
191       19.81  4.19  Female    Yes  Thur   Lunch     2
192       28.44  2.56    Male    Yes  Thur   Lunch     2
197       43.11  5.00  Female    Yes  Thur   Lunch     4
200       18.71  4.00    Male    Yes  Thur   Lunch     3
204       20.53  4.00    Male    Yes  Thur   Lunch     4
206       26.59  3.41    Male    Yes   Sat  Dinner     3
207       38.73  3.00    Male    Yes   Sat  Dinner     4
208       24.27  2.03    Male    Yes   Sat  Dinner     2
210       30.06  2.00    Male    Yes   Sat  Dinner     3
211       25.89  5.16    Male    Yes   Sat  Dinner     4
212       48.33  9.00    Male     No   Sat  Dinner     4
214       28.17  6.50  Female    Yes   Sat  Dinner     3
216       28.15  3.00    Male    Yes   Sat  Dinner     5
219       30.14  3.09  Female    Yes   Sat  Dinner     4
227       20.45  3.00    Male     No   Sat  Dinner     4
229       22.12  2.88  Female    Yes   Sat  Dinner     2
230       24.01  2.00    Male    Yes   Sat  Dinner     4
237       32.83  1.17    Male    Yes   Sat  Dinner     2
238       35.83  4.67  Female     No   Sat  Dinner     3
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2

[97 rows x 7 columns]

Upvotes: 3

Related Questions