Ale
Ale

Reputation: 1004

pandas operation by group

I have a dataframe like this

df = pd.DataFrame({'id': [205,205,205, 211, 211, 211]
                  , 'date': pd.to_datetime(['2019-12-01','2020-01-01', '2020-02-01'
                                           ,'2019-12-01' ,'2020-01-01', '2020-03-01'])})
df

    id  date
0   205 2019-12-01
1   205 2020-01-01
2   205 2020-02-01
3   211 2019-12-01
4   211 2020-01-01
5   211 2020-03-01

where the column date is made by consecutive months for id 205 but not for id 211.

I want to keep only the observations (id) for which I have monthly data without jumps. In this example I want:

    id  date
0   205 2019-12-01
1   205 2020-01-01
2   205 2020-02-01

Here I am collecting the id to keep:

keep_id = []
for num in pd.unique(df.index):
    temp = (df.loc[df['id']==num,'date'].dt.year - df.loc[df['id']==num,'date'].shift(1).dt.year) * 12 + df.loc[df['id']==num,'date'].dt.month - df.loc[df['id']==num,'date'].shift(1).dt.month
    temp.values[0] = 1.0 # here I correct the first entry
    if (temp==1.).all():
        keep_id.append(num)

where I am using (df.loc[num,'date'].dt.year - df.loc[num,'date'].shift(1).dt.year) * 12 + df.loc[num,'date'].dt.month - df.loc[num,'date'].shift(1).dt.month to compute the difference in months from the previous date for every id.

This seems to work when tested on a small portion of df, but I'm sure there is a better way of doing this, maybe using the .groupby() method.

Since df is made of millions of observations my code takes too much time (and I'd like to learn a more efficient and pythonic way of doing this)

Upvotes: 1

Views: 50

Answers (2)

ShadyFortress
ShadyFortress

Reputation: 46

What you want to do is use groupby-filter rather than a groupby apply.

df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())

provides exactly:

    id       date
0  205 2019-12-01
1  205 2020-01-01
2  205 2020-02-01

And indeed, I would keep the index unique, there are too many useful characteristics to retain.

Both this response and Michael's above are correct in terms of output. In terms of performance, they are very similar as well:

%timeit df.groupby('id').filter(lambda x: not (x.date.diff() > pd.Timedelta(days=32)).any())
1.48 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

and

%timeit df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]
1.7 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For most operations, this difference is negligible.

Upvotes: 2

Michael Szczesny
Michael Szczesny

Reputation: 5036

You can use the following approach. Only ~3x faster in my tests.

df[df.groupby('id')['date'].transform(lambda x: x.diff().max() < pd.Timedelta(days=32))]

Out:

          date
id            
205 2019-12-01
205 2020-01-01
205 2020-02-01

Upvotes: 0

Related Questions