Marcus
Marcus

Reputation: 1023

Efficient way of filtering by datetime in groupby

Given the DataFrame generated by:

import numpy as np
import pandas as pd
from datetime import timedelta

np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=14, freq='9H')
ids = [1]*5 + [2]*2 + [3]*7
df = pd.DataFrame({'id': ids, 'time_entered': rng, 'val': np.random.randn(len(rng))})

df:

    id  time_entered        val
0   1   2015-02-24 00:00:00 1.764052
1   1   2015-02-24 09:00:00 0.400157
2   1   2015-02-24 18:00:00 0.978738
3   1   2015-02-25 03:00:00 2.240893
4   1   2015-02-25 12:00:00 1.867558
5   2   2015-02-25 21:00:00 -0.977278
6   2   2015-02-26 06:00:00 0.950088
7   3   2015-02-26 15:00:00 -0.151357
8   3   2015-02-27 00:00:00 -0.103219
9   3   2015-02-27 09:00:00 0.410599
10  3   2015-02-27 18:00:00 0.144044
11  3   2015-02-28 03:00:00 1.454274
12  3   2015-02-28 12:00:00 0.761038
13  3   2015-02-28 21:00:00 0.121675

I need to, for each id, remove rows which are more than 24hours (1 day) from the latest time_entered, for that id. My current solution:

def custom_transform(x):
    datetime_from = x["time_entered"].max() - timedelta(days=1)
    x = x[x["time_entered"] > datetime_from]
    return x

df.groupby("id").apply(lambda x: custom_transform(x)).reset_index(drop=True)

which gives the correct, expected, output:

    id  time_entered        val
0   1   2015-02-24 18:00:00 0.978738
1   1   2015-02-25 03:00:00 2.240893
2   1   2015-02-25 12:00:00 1.867558
3   2   2015-02-25 21:00:00 -0.977278
4   2   2015-02-26 06:00:00 0.950088
5   3   2015-02-28 03:00:00 1.454274
6   3   2015-02-28 12:00:00 0.761038
7   3   2015-02-28 21:00:00 0.121675

However, my real data is tens of millions of rows, and hundreds of thousands of unique ids, because of this, this solution is infeasible (takes very long time).

Is there a more efficient way to filter the data? I appreciate all ideas!

Upvotes: 7

Views: 319

Answers (2)

BENY
BENY

Reputation: 323316

df.groupby('id').apply(lambda x : x[(x['time_entered'].max()-x['time_entered'])<pd.Timedelta('1D')]).reset_index(drop=True)
Out[322]: 
   id        time_entered       val
0   1 2015-02-24 18:00:00  0.978738
1   1 2015-02-25 03:00:00  2.240893
2   1 2015-02-25 12:00:00  1.867558
3   2 2015-02-25 21:00:00 -0.977278
4   2 2015-02-26 06:00:00  0.950088
5   3 2015-02-28 03:00:00  1.454274
6   3 2015-02-28 12:00:00  0.761038
7   3 2015-02-28 21:00:00  0.121675

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150785

Generally, avoid groupby().apply() since it's not vectorized across groups, not to mention the overhead for memory allocation if you are returning new dataframes as in your case.

How about finding the time threshold with groupby().transform then use boolean indexing on the whole data:

time_max_by_id = df.groupby('id')['time_entered'].transform('max') - pd.Timedelta('1D')
df[df['time_entered'] > time_max_by_id]

Output:

    id        time_entered       val
2    1 2015-02-24 18:00:00  0.978738
3    1 2015-02-25 03:00:00  2.240893
4    1 2015-02-25 12:00:00  1.867558
5    2 2015-02-25 21:00:00 -0.977278
6    2 2015-02-26 06:00:00  0.950088
11   3 2015-02-28 03:00:00  1.454274
12   3 2015-02-28 12:00:00  0.761038
13   3 2015-02-28 21:00:00  0.121675

Upvotes: 4

Related Questions