The L
The L

Reputation: 126

Condition on all rows of a groupby

Concerning this type of dataframe:

import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1,1,1,1,2,2,2,3],
                   'Time': [datetime.date(2019, 12, 1), datetime.date(2019, 12, 5),datetime.date(2019, 12, 8),datetime.date(2019, 8, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 3),datetime.date(2019, 12, 20)],
                   'Value':[2,2,2,50,7,100,7,5]})
    ID  Time        Value
0   1   2019-12-01  2
1   1   2019-12-05  2
2   1   2019-12-08  2
3   1   2019-08-04  50
4   2   2019-11-04  7
5   2   2019-11-04  100
6   2   2019-11-03  7
7   3   2019-12-20  5

I am intersted only in the 3 latest values (regarding the time)

and

I would like to keep only the IDs where these 3 values are < 10.

So my desired output will look like this:

    ID
0   1

Indeed the value 50 for the first ID is the fourth last value, so it's not interesting.

Upvotes: 4

Views: 574

Answers (2)

rpanai
rpanai

Reputation: 13447

Within a groupby:

  • I sort the group by time
  • use a boolean to determine if the condition <10 is satisfied or not
  • Take the last 3 values only and sum the boolean defined above
  • Check if this number is exactly 3
grp = df.groupby("ID")\
        .apply(lambda x: 
         x.sort_values("Time")["Value"].lt(10)[-3:].sum()==3)

grp[grp]
ID
1    True
dtype: bool

Upvotes: 1

mozway
mozway

Reputation: 262149

You could use a combination of query and groupby+size:

ids = df.query('Value < 10').groupby('ID')['Time'].size().ge(3)
ids[ids].reset_index().drop('Time', axis=1)

output:

   ID
0   1

Alternative using filter (slower):

df.groupby('ID').filter(lambda g: len(g[g['Value'].lt(10)]['Time'].nlargest(3))>2)

output:

   ID       Time  Value
0   1 2019-12-01      2
1   1 2019-12-05      2
2   1 2019-12-08      2
3   1 2019-08-04     50

and to get only the ID: add ['ID'].unique()

Upvotes: 3

Related Questions