pandas operation by group

Question

I have a dataframe like this

df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
                  , 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
                                           ,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df

    id  date
0   205 2019-12-01
1   205 2020-01-01
2   205 2020-02-01
3   211 2019-12-01
4   211 2020-01-01
5   211 2020-03-01

where the column date is made by consecutive months for id 205 but not for id 211.

I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:

    id  date
0   205 2019-12-01
1   205 2020-01-01
2   205 2020-02-01

Here I am collecting the id to keep:

keep_id = []
for num in pd.unique(df.index):
    temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
    temp.values[0] = 1.0 # here I correct the first entry
    if (temp==1.).all():
        keep_id.append(num)

where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.

This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.

Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)

ShadyFortress · Accepted Answer

What you want to do is use groupby-filter rather than a groupby apply.

df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())

provides exactly:

    id       date
0  205 2019-12-01
1  205 2020-01-01
2  205 2020-02-01

And indeed, I would keep the index unique, there are too many useful characteristics to retain.

Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:

%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and

%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For most operations, this difference is negligible.

pandas operation by group

Answers (2)

Related Questions