Reputation: 1949
I want to group nearby dates together, using a rolling window (?) of three week periods.
See example and attempt below:
import pandas as pd
d = {'id':[1, 1, 1, 1, 2, 3],
'datefield':['2021-01-01', '2021-01-15', '2021-01-30', '2021-02-05', '2020-02-10', '2020-02-20']}
df = pd.DataFrame(data=d)
df['datefield'] = pd.to_datetime(df['datefield'])
# id datefield
#0 1 2021-01-01
#1 1 2021-01-15
#2 1 2021-02-01
#3 2 2020-02-10
#4 3 2020-02-20
df['event'] = df.groupby(['id', pd.Grouper(key='datefield', freq='3W')]).ngroup()
# id datefield event
#0 1 2021-01-01 0
#1 1 2021-01-15 0
#2 1 2021-01-30 1 #Should be 0, since last id 1 event happened just 2 weeks ago
#3 1 2021-02-05 1 #Should be 0
#4 2 2020-02-10 2
#5 3 2020-02-20 3 #Correct, within 3 weeks of another but since the ids are not the same the event is different
Upvotes: 0
Views: 208
Reputation: 11171
It looks like you want the diff between consecutive rows to be three weeks or less, otherwise a new group is formed. You can do it like this, starting from initial time t0
:
df = df.sort_values("datefield").reset_index(drop=True)
t0 = df.datefield.iloc[0]
df["delta_t"] = pd.TimedeltaIndex(df.datefield - t0)
df["group"] = (df.delta_t.dt.days.diff() > 21).cumsum()
output:
id datefield delta_t group
0 2 2020-02-10 0 days 0
1 2 2020-03-20 39 days 1
2 1 2021-01-01 326 days 2
3 1 2021-01-15 340 days 2
4 1 2021-01-30 355 days 2
5 1 2021-02-05 361 days 2
Note that your original dataframe is not sorted properly.
Upvotes: 1
Reputation: 2248
Can compute different columns to make it easily understandable
df
id datefield
0 1 2021-01-01
1 1 2021-01-15
2 1 2021-01-30
3 1 2021-02-05
4 2 2020-02-10
5 2 2020-03-20
Calculate difference between dates in number of days
df['diff'] = df['datefield'].diff().dt.days
Get previous ID
df['prevId'] = df['id'].shift()
Decide whether to increment or not
df['increment'] = np.where((df['diff']>21) | (df['prevId'] != df['id']), 1, 0)
Lastly, just get the cumulative sum
df['event'] = df['increment'].cumsum()
Output
id datefield diff prevId increment event
0 1 2021-01-01 NaN NaN 1 1
1 1 2021-01-15 14.0 1.0 0 1
2 1 2021-01-30 15.0 1.0 0 1
3 1 2021-02-05 6.0 1.0 0 1
4 2 2020-02-10 -361.0 1.0 1 2
5 2 2020-03-20 39.0 2.0 1 3
Upvotes: 2
Reputation: 35646
Let's try a different approach using a boolean series instead:
df['group'] = ((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift()))).cumsum()
Output:
id datefield group 0 1 2021-01-01 1 1 1 2021-01-15 1 2 1 2021-01-30 1 3 1 2021-02-05 1 4 2 2020-02-10 2 5 2 2020-03-20 3
Is the difference between the previous row greater than 3 weeks:
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))))
0 False
1 False
2 False
3 False
4 False
5 True
Name: datefield, dtype: bool
Or is the current id
not equal to the previous id
:
print((df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 False
Name: id, dtype: bool
or
(|
) together the conditions
print((df['datefield'].diff()
.fillna(pd.Timedelta(1))
.gt(pd.Timedelta(weeks=3))) |
(df['id'].ne(df['id'].shift())))
0 True
1 False
2 False
3 False
4 True
5 True
dtype: bool
Then use cumsum to increment every where there is a True
value to delimit the groups.
*Assumes id
and datafield
columns are appropriately ordered.
Upvotes: 1