Reputation: 1023
Given the DataFrame
generated by:
import numpy as np
import pandas as pd
from datetime import timedelta
np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=14, freq='9H')
ids = [1]*5 + [2]*2 + [3]*7
df = pd.DataFrame({'id': ids, 'time_entered': rng, 'val': np.random.randn(len(rng))})
df
:
id time_entered val
0 1 2015-02-24 00:00:00 1.764052
1 1 2015-02-24 09:00:00 0.400157
2 1 2015-02-24 18:00:00 0.978738
3 1 2015-02-25 03:00:00 2.240893
4 1 2015-02-25 12:00:00 1.867558
5 2 2015-02-25 21:00:00 -0.977278
6 2 2015-02-26 06:00:00 0.950088
7 3 2015-02-26 15:00:00 -0.151357
8 3 2015-02-27 00:00:00 -0.103219
9 3 2015-02-27 09:00:00 0.410599
10 3 2015-02-27 18:00:00 0.144044
11 3 2015-02-28 03:00:00 1.454274
12 3 2015-02-28 12:00:00 0.761038
13 3 2015-02-28 21:00:00 0.121675
I need to, for each id
, remove rows which are more than 24hours (1 day) from the latest time_entered
, for that id
. My current solution:
def custom_transform(x):
datetime_from = x["time_entered"].max() - timedelta(days=1)
x = x[x["time_entered"] > datetime_from]
return x
df.groupby("id").apply(lambda x: custom_transform(x)).reset_index(drop=True)
which gives the correct, expected, output:
id time_entered val
0 1 2015-02-24 18:00:00 0.978738
1 1 2015-02-25 03:00:00 2.240893
2 1 2015-02-25 12:00:00 1.867558
3 2 2015-02-25 21:00:00 -0.977278
4 2 2015-02-26 06:00:00 0.950088
5 3 2015-02-28 03:00:00 1.454274
6 3 2015-02-28 12:00:00 0.761038
7 3 2015-02-28 21:00:00 0.121675
However, my real data is tens of millions of rows, and hundreds of thousands of unique ids, because of this, this solution is infeasible (takes very long time).
Is there a more efficient way to filter the data? I appreciate all ideas!
Upvotes: 7
Views: 319
Reputation: 323316
df.groupby('id').apply(lambda x : x[(x['time_entered'].max()-x['time_entered'])<pd.Timedelta('1D')]).reset_index(drop=True)
Out[322]:
id time_entered val
0 1 2015-02-24 18:00:00 0.978738
1 1 2015-02-25 03:00:00 2.240893
2 1 2015-02-25 12:00:00 1.867558
3 2 2015-02-25 21:00:00 -0.977278
4 2 2015-02-26 06:00:00 0.950088
5 3 2015-02-28 03:00:00 1.454274
6 3 2015-02-28 12:00:00 0.761038
7 3 2015-02-28 21:00:00 0.121675
Upvotes: 0
Reputation: 150785
Generally, avoid groupby().apply()
since it's not vectorized across groups, not to mention the overhead for memory allocation if you are returning new dataframes as in your case.
How about finding the time threshold with groupby().transform
then use boolean indexing on the whole data:
time_max_by_id = df.groupby('id')['time_entered'].transform('max') - pd.Timedelta('1D')
df[df['time_entered'] > time_max_by_id]
Output:
id time_entered val
2 1 2015-02-24 18:00:00 0.978738
3 1 2015-02-25 03:00:00 2.240893
4 1 2015-02-25 12:00:00 1.867558
5 2 2015-02-25 21:00:00 -0.977278
6 2 2015-02-26 06:00:00 0.950088
11 3 2015-02-28 03:00:00 1.454274
12 3 2015-02-28 12:00:00 0.761038
13 3 2015-02-28 21:00:00 0.121675
Upvotes: 4