Bera
Bera

Reputation: 1949

Group nearby dates

I want to group nearby dates together, using a rolling window (?) of three week periods.

See example and attempt below:

import pandas as pd

d = {'id':[1, 1, 1, 1, 2, 3],
     'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])

#    id  datefield
#0   1 2021-01-01
#1   1 2021-01-15
#2   1 2021-02-01
#3   2 2020-02-10
#4   3 2020-02-20

df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()

#   id  datefield  event
#0   1 2021-01-01      0
#1   1 2021-01-15      0
#2   1 2021-01-30      1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3   1 2021-02-05      1 #Should be 0
#4   2 2020-02-10      2 
#5   3 2020-02-20      3 #Correct, within 3 weeks of another but since the ids are not the same the event is different

Upvotes: 0

Views: 208

Answers (3)

anon01
anon01

Reputation: 11171

It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0:

df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()

output:

   id  datefield  delta_t  group
0   2 2020-02-10   0 days      0
1   2 2020-03-20  39 days      1
2   1 2021-01-01 326 days      2
3   1 2021-01-15 340 days      2
4   1 2021-01-30 355 days      2
5   1 2021-02-05 361 days      2

Note that your original dataframe is not sorted properly.

Upvotes: 1

Shubham Periwal
Shubham Periwal

Reputation: 2248

Can compute different columns to make it easily understandable

df

   id   datefield
0   1   2021-01-01
1   1   2021-01-15
2   1   2021-01-30
3   1   2021-02-05
4   2   2020-02-10
5   2   2020-03-20

Calculate difference between dates in number of days

df['diff'] = df['datefield'].diff().dt.days

Get previous ID

df['prevId'] = df['id'].shift()

Decide whether to increment or not

df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)

Lastly, just get the cumulative sum

df['event'] = df['increment'].cumsum()

Output

   id   datefield   diff    prevId  increment   event
0   1   2021-01-01  NaN      NaN    1   1
1   1   2021-01-15  14.0    1.0     0   1
2   1   2021-01-30  15.0    1.0     0   1
3   1   2021-02-05  6.0     1.0     0   1
4   2   2020-02-10  -361.0  1.0     1   2
5   2   2020-03-20  39.0    2.0     1   3

Upvotes: 2

Henry Ecker
Henry Ecker

Reputation: 35646

Let's try a different approach using a boolean series instead:

df['group'] = ((df['datefield'].diff()
                .fillna(pd.Timedelta(1))
                .gt(pd.Timedelta(weeks=3))) |
               (df['id'].ne(df['id'].shift()))).cumsum()

Output:

   id  datefield  group
0   1 2021-01-01      1
1   1 2021-01-15      1
2   1 2021-01-30      1
3   1 2021-02-05      1
4   2 2020-02-10      2
5   2 2020-03-20      3

Is the difference between the previous row greater than 3 weeks:

print((df['datefield'].diff()
       .fillna(pd.Timedelta(1))
       .gt(pd.Timedelta(weeks=3))))
0    False
1    False
2    False
3    False
4    False
5     True
Name: datefield, dtype: bool

Or is the current id not equal to the previous id:

print((df['id'].ne(df['id'].shift())))
0     True
1    False
2    False
3    False
4     True
5    False
Name: id, dtype: bool

or (|) together the conditions

print((df['datefield'].diff()
       .fillna(pd.Timedelta(1))
       .gt(pd.Timedelta(weeks=3))) |
      (df['id'].ne(df['id'].shift())))
0     True
1    False
2    False
3    False
4     True
5     True
dtype: bool

Then use cumsum to increment every where there is a True value to delimit the groups.


*Assumes id and datafield columns are appropriately ordered.

Upvotes: 1

Related Questions