keynesiancross
keynesiancross

Reputation: 3529

Pandas + GroupBy DateTime with time threshold

Is there a way to use groupby with a time threshold? For example, groupby times that are within 1min of eachother:

Cents ED_YF Vals EventTimestamp
10 182.5 3 323.473 2022-06-28 13:41:01
11 182.5 3 323.473 2022-06-28 13:41:01
12 153.5 2 384.767 2022-06-28 14:36:15
13 153.5 2 384.767 2022-06-28 14:37:11
14 43 2 107.785 2022-06-28 16:40:37
15 43 2 107.785 2022-06-28 16:40:37
4 152 2 381.007 2022-06-28 16:41:00
5 152 2 381.007 2022-06-28 16:41:00

I've seen it used in a merge_asof, where a threshold is a provided input, but I havent seen anything similar in the Panda docs for a GroupBy... In this example, I'm looking to add a column ["GroupID"] (.ngroup()) based on the column EventTimestamp:

Cents ED_YF Vals EventTimestamp GroupID
10 182.5 3 323.473 2022-06-28 13:41:01 0
11 182.5 3 323.473 2022-06-28 13:41:01 0
12 153.5 2 384.767 2022-06-28 14:36:15 1
13 153.5 2 384.767 2022-06-28 14:37:11 1
14 43 2 107.785 2022-06-28 16:40:37 2
15 43 2 107.785 2022-06-28 16:40:37 2
4 152 2 381.007 2022-06-28 16:41:00 2
5 152 2 381.007 2022-06-28 16:41:00 2

Thanks!

Upvotes: 1

Views: 218

Answers (3)

It_is_Chris
It_is_Chris

Reputation: 14073

Try using cumsum

df['GroupID'] = (df['EventTimestamp'].diff() >= pd.Timedelta('1T')).cumsum()

    Cents  ED_YF     Vals      EventTimestamp  GroupID
10  182.5      3  323.473 2022-06-28 13:41:01        0
11  182.5      3  323.473 2022-06-28 13:41:01        0
12  153.5      2  384.767 2022-06-28 14:36:15        1
13  153.5      2  384.767 2022-06-28 14:37:11        1
14   43.0      2  107.785 2022-06-28 16:40:37        2
15   43.0      2  107.785 2022-06-28 16:40:37        2
4   152.0      2  381.007 2022-06-28 16:41:00        2
5   152.0      2  381.007 2022-06-28 16:41:00        2

Upvotes: 2

AlirezaAsadi
AlirezaAsadi

Reputation: 803

Here is my answer:

import pandas as pd

df = pd.DataFrame({'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01',
                                      '2022-06-28 14:36:15', '2022-06-28 14:37:11',
                                      '2022-06-28 16:40:37', '2022-06-28 16:40:37',
                                      '2022-06-28 16:41:00', '2022-06-28 16:41:00']})
df['EventTimestamp'] = pd.to_datetime(df['EventTimestamp'])
groups  = df['EventTimestamp'].diff().dt.seconds.gt(60).cumsum()
df['GroupID'] = groups
print(df)

Output is like:

     EventTimestamp       GroupID
0 2022-06-28 13:41:01        0
1 2022-06-28 13:41:01        0
2 2022-06-28 14:36:15        1
3 2022-06-28 14:37:11        1
4 2022-06-28 16:40:37        2
5 2022-06-28 16:40:37        2
6 2022-06-28 16:41:00        2
7 2022-06-28 16:41:00        2

Upvotes: 1

Zach Flanders
Zach Flanders

Reputation: 1304

You can use the datetime methods, specifically strftime to convert the timestamps to the level of specificity you want and use that as your parameter to the groupby method.

df = pd.DataFrame({
    'Vals': [323.473, 323.473, 384.767, 384.767, 107.785, 107.785], 
    'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01', '2022-06-28 14:36:15', '2022-06-28 14:37:11', '2022-06-28 16:40:37', '2022-06-28 16:40:37'],
})
result = df.assign(GroupId=(
    df
    .groupby(pd.to_datetime(df['EventTimestamp']).dt.strftime('%Y-%m-%d %H:%I'))
    .ngroup()
))

Output:

      Vals       EventTimestamp  GroupId
0  323.473  2022-06-28 13:41:01        0
1  323.473  2022-06-28 13:41:01        0
2  384.767  2022-06-28 14:36:15        1
3  384.767  2022-06-28 14:37:11        1
4  107.785  2022-06-28 16:40:37        2
5  107.785  2022-06-28 16:40:37        2

Upvotes: 0

Related Questions