Reputation: 126
Concerning this type of dataframe:
import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1,1,1,1,2,2,2,3],
'Time': [datetime.date(2019, 12, 1), datetime.date(2019, 12, 5),datetime.date(2019, 12, 8),datetime.date(2019, 8, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 4),datetime.date(2019, 11, 3),datetime.date(2019, 12, 20)],
'Value':[2,2,2,50,7,100,7,5]})
ID Time Value
0 1 2019-12-01 2
1 1 2019-12-05 2
2 1 2019-12-08 2
3 1 2019-08-04 50
4 2 2019-11-04 7
5 2 2019-11-04 100
6 2 2019-11-03 7
7 3 2019-12-20 5
I am intersted only in the 3 latest values (regarding the time)
and
I would like to keep only the IDs where these 3 values are < 10.
So my desired output will look like this:
ID
0 1
Indeed the value 50 for the first ID is the fourth last value, so it's not interesting.
Upvotes: 4
Views: 574
Reputation: 13447
Within a groupby:
grp = df.groupby("ID")\
.apply(lambda x:
x.sort_values("Time")["Value"].lt(10)[-3:].sum()==3)
grp[grp]
ID
1 True
dtype: bool
Upvotes: 1
Reputation: 262149
You could use a combination of query
and groupby
+size
:
ids = df.query('Value < 10').groupby('ID')['Time'].size().ge(3)
ids[ids].reset_index().drop('Time', axis=1)
output:
ID
0 1
Alternative using filter
(slower):
df.groupby('ID').filter(lambda g: len(g[g['Value'].lt(10)]['Time'].nlargest(3))>2)
output:
ID Time Value
0 1 2019-12-01 2
1 1 2019-12-05 2
2 1 2019-12-08 2
3 1 2019-08-04 50
and to get only the ID: add ['ID'].unique()
Upvotes: 3