Reputation: 3529

Pandas + GroupBy DateTime with time threshold

Is there a way to use groupby with a time threshold? For example, groupby times that are within 1min of eachother:

	Cents	ED_YF	Vals	EventTimestamp
10	182.5	3	323.473	2022-06-28 13:41:01
11	182.5	3	323.473	2022-06-28 13:41:01
12	153.5	2	384.767	2022-06-28 14:36:15
13	153.5	2	384.767	2022-06-28 14:37:11
14	43	2	107.785	2022-06-28 16:40:37
15	43	2	107.785	2022-06-28 16:40:37
4	152	2	381.007	2022-06-28 16:41:00
5	152	2	381.007	2022-06-28 16:41:00

I've seen it used in a merge_asof, where a threshold is a provided input, but I havent seen anything similar in the Panda docs for a GroupBy... In this example, I'm looking to add a column ["GroupID"] (.ngroup()) based on the column EventTimestamp:

	Cents	ED_YF	Vals	EventTimestamp	GroupID
10	182.5	3	323.473	2022-06-28 13:41:01	0
11	182.5	3	323.473	2022-06-28 13:41:01	0
12	153.5	2	384.767	2022-06-28 14:36:15	1
13	153.5	2	384.767	2022-06-28 14:37:11	1
14	43	2	107.785	2022-06-28 16:40:37	2
15	43	2	107.785	2022-06-28 16:40:37	2
4	152	2	381.007	2022-06-28 16:41:00	2
5	152	2	381.007	2022-06-28 16:41:00	2

Thanks!

Upvotes: 1

Answers (3)

It_is_Chris

Reputation: 14073

Try using cumsum

df['GroupID'] = (df['EventTimestamp'].diff() >= pd.Timedelta('1T')).cumsum()

    Cents  ED_YF     Vals      EventTimestamp  GroupID
10  182.5      3  323.473 2022-06-28 13:41:01        0
11  182.5      3  323.473 2022-06-28 13:41:01        0
12  153.5      2  384.767 2022-06-28 14:36:15        1
13  153.5      2  384.767 2022-06-28 14:37:11        1
14   43.0      2  107.785 2022-06-28 16:40:37        2
15   43.0      2  107.785 2022-06-28 16:40:37        2
4   152.0      2  381.007 2022-06-28 16:41:00        2
5   152.0      2  381.007 2022-06-28 16:41:00        2

Upvotes: 2

AlirezaAsadi

Reputation: 803

Here is my answer:

import pandas as pd

df = pd.DataFrame({'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01',
                                      '2022-06-28 14:36:15', '2022-06-28 14:37:11',
                                      '2022-06-28 16:40:37', '2022-06-28 16:40:37',
                                      '2022-06-28 16:41:00', '2022-06-28 16:41:00']})
df['EventTimestamp'] = pd.to_datetime(df['EventTimestamp'])
groups  = df['EventTimestamp'].diff().dt.seconds.gt(60).cumsum()
df['GroupID'] = groups
print(df)

Output is like:

     EventTimestamp       GroupID
0 2022-06-28 13:41:01        0
1 2022-06-28 13:41:01        0
2 2022-06-28 14:36:15        1
3 2022-06-28 14:37:11        1
4 2022-06-28 16:40:37        2
5 2022-06-28 16:40:37        2
6 2022-06-28 16:41:00        2
7 2022-06-28 16:41:00        2

Upvotes: 1

Zach Flanders

Reputation: 1304

You can use the datetime methods, specifically strftime to convert the timestamps to the level of specificity you want and use that as your parameter to the groupby method.

df = pd.DataFrame({
    'Vals': [323.473, 323.473, 384.767, 384.767, 107.785, 107.785], 
    'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01', '2022-06-28 14:36:15', '2022-06-28 14:37:11', '2022-06-28 16:40:37', '2022-06-28 16:40:37'],
})
result = df.assign(GroupId=(
    df
    .groupby(pd.to_datetime(df['EventTimestamp']).dt.strftime('%Y-%m-%d %H:%I'))
    .ngroup()
))

Output:

      Vals       EventTimestamp  GroupId
0  323.473  2022-06-28 13:41:01        0
1  323.473  2022-06-28 13:41:01        0
2  384.767  2022-06-28 14:36:15        1
3  384.767  2022-06-28 14:37:11        1
4  107.785  2022-06-28 16:40:37        2
5  107.785  2022-06-28 16:40:37        2

Upvotes: 0

Pandas + GroupBy DateTime with time threshold

Answers (3)

Related Questions