Reputation: 3529
Is there a way to use groupby with a time threshold? For example, groupby times that are within 1min of eachother:
Cents | ED_YF | Vals | EventTimestamp | |
---|---|---|---|---|
10 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 |
11 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 |
12 | 153.5 | 2 | 384.767 | 2022-06-28 14:36:15 |
13 | 153.5 | 2 | 384.767 | 2022-06-28 14:37:11 |
14 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 |
15 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 |
4 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 |
5 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 |
I've seen it used in a merge_asof, where a threshold is a provided input, but I havent seen anything similar in the Panda docs for a GroupBy... In this example, I'm looking to add a column ["GroupID"] (.ngroup()) based on the column EventTimestamp:
Cents | ED_YF | Vals | EventTimestamp | GroupID | |
---|---|---|---|---|---|
10 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 | 0 |
11 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 | 0 |
12 | 153.5 | 2 | 384.767 | 2022-06-28 14:36:15 | 1 |
13 | 153.5 | 2 | 384.767 | 2022-06-28 14:37:11 | 1 |
14 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 | 2 |
15 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 | 2 |
4 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 | 2 |
5 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 | 2 |
Thanks!
Upvotes: 1
Views: 218
Reputation: 14073
Try using cumsum
df['GroupID'] = (df['EventTimestamp'].diff() >= pd.Timedelta('1T')).cumsum()
Cents ED_YF Vals EventTimestamp GroupID
10 182.5 3 323.473 2022-06-28 13:41:01 0
11 182.5 3 323.473 2022-06-28 13:41:01 0
12 153.5 2 384.767 2022-06-28 14:36:15 1
13 153.5 2 384.767 2022-06-28 14:37:11 1
14 43.0 2 107.785 2022-06-28 16:40:37 2
15 43.0 2 107.785 2022-06-28 16:40:37 2
4 152.0 2 381.007 2022-06-28 16:41:00 2
5 152.0 2 381.007 2022-06-28 16:41:00 2
Upvotes: 2
Reputation: 803
Here is my answer:
import pandas as pd
df = pd.DataFrame({'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01',
'2022-06-28 14:36:15', '2022-06-28 14:37:11',
'2022-06-28 16:40:37', '2022-06-28 16:40:37',
'2022-06-28 16:41:00', '2022-06-28 16:41:00']})
df['EventTimestamp'] = pd.to_datetime(df['EventTimestamp'])
groups = df['EventTimestamp'].diff().dt.seconds.gt(60).cumsum()
df['GroupID'] = groups
print(df)
Output is like:
EventTimestamp GroupID
0 2022-06-28 13:41:01 0
1 2022-06-28 13:41:01 0
2 2022-06-28 14:36:15 1
3 2022-06-28 14:37:11 1
4 2022-06-28 16:40:37 2
5 2022-06-28 16:40:37 2
6 2022-06-28 16:41:00 2
7 2022-06-28 16:41:00 2
Upvotes: 1
Reputation: 1304
You can use the datetime methods, specifically strftime
to convert the timestamps to the level of specificity you want and use that as your parameter to the groupby method.
df = pd.DataFrame({
'Vals': [323.473, 323.473, 384.767, 384.767, 107.785, 107.785],
'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01', '2022-06-28 14:36:15', '2022-06-28 14:37:11', '2022-06-28 16:40:37', '2022-06-28 16:40:37'],
})
result = df.assign(GroupId=(
df
.groupby(pd.to_datetime(df['EventTimestamp']).dt.strftime('%Y-%m-%d %H:%I'))
.ngroup()
))
Output:
Vals EventTimestamp GroupId
0 323.473 2022-06-28 13:41:01 0
1 323.473 2022-06-28 13:41:01 0
2 384.767 2022-06-28 14:36:15 1
3 384.767 2022-06-28 14:37:11 1
4 107.785 2022-06-28 16:40:37 2
5 107.785 2022-06-28 16:40:37 2
Upvotes: 0