sql_knievel
sql_knievel

Reputation: 1369

Assign seconds elapsed between subsequent index values to new column

Lets's say I have a Pandas dataframe where the index is a datetime value. I want to add a column that is the calculation of the total_seconds elapsed between each subsequent record.

Problem setup:

import pandas as pd

df = pd.DataFrame(
    data=[
        ["2021-02-24 20:53:14.572000+00:00", "2362"],
        ["2021-02-24 21:02:28.567000+00:00", "4264"],
        ["2021-02-24 21:02:29.572000+00:00", "5160"],
        ["2021-02-24 21:02:30.561000+00:00", "6183"],
        ["2021-02-24 21:03:55.606000+00:00", "9654"],
    ],
    columns=["event_time", "some_metric"],
)

# Make the timestamp our index and make sure the events are in order.
df["event_time"] = pd.to_datetime(df["event_time"])
df = df.set_index("event_time")
df = df.sort_index()

So now my dataframe looks like this:

                                 some_metric
                      event_time    
2021-02-24 20:53:14.572000+00:00    2362
2021-02-24 21:02:28.567000+00:00    4264
2021-02-24 21:02:29.572000+00:00    5160
2021-02-24 21:02:30.561000+00:00    6183
2021-02-24 21:03:55.606000+00:00    9654

Now I want to add a new column that is the seconds elapsed until the next event.

Here's what I'm trying, which runs without error::

df["seconds_until_next"] = (
    df.reset_index()["event_time"].shift(-1) - df.reset_index()["event_time"]
).dt.total_seconds()

But the resulting dataframe looks like this, with all NaNs in the new column:

                                 some_metric    seconds_until_next
                      event_time
2021-02-24 20:53:14.572000+00:00    2362              NaN
2021-02-24 21:02:28.567000+00:00    4264              NaN
2021-02-24 21:02:29.572000+00:00    5160              NaN
2021-02-24 21:02:30.561000+00:00    6183              NaN
2021-02-24 21:03:55.606000+00:00    9654              NaN

Which is weird, because just running the right side of that operation looks like it returns the values I want:

(df.reset_index()["event_time"].shift(-1) - df.reset_index()["event_time"]).dt.total_seconds()

returns:

0    553.995
1      1.005
2      0.989
3     85.045
4        NaN
Name: event_time, dtype: float64

What's going on here? I assume it's because of a mismatch in index values between the df on the left side of the equals and the result on the right side? How do I resolve that?

Update: These answers are great, wish I could give everyone the points. I feel like the magic piece of knowledge I was missing was .to_series(). Will definitely keep the other suggestions in mind. This way works and feels good in my brain, not sure if it's the most performant:

df["seconds_until_next"] = (
    df.index.to_series().shift(-1) - df.index.to_series()
).dt.total_seconds()

Upvotes: 1

Views: 52

Answers (3)

Rob Raymond
Rob Raymond

Reputation: 31146

There a subtle issue - will be a range index which does not match datetime index. Worked around by making it an array with .values

df.assign(seconds_until_next=(pd.Series(df.index).shift(-1) - df.index).dt.total_seconds().values)

event_time some_metric seconds_until_next
2021-02-24 20:53:14.572000+00:00 2362 553.995
2021-02-24 21:02:28.567000+00:00 4264 1.005
2021-02-24 21:02:29.572000+00:00 5160 0.989
2021-02-24 21:02:30.561000+00:00 6183 85.045
2021-02-24 21:03:55.606000+00:00 9654 nan

Upvotes: 0

perl
perl

Reputation: 9941

A similar (but slightly more concise) option with diff (instead of shifting and subtracting):

df['seconds_until_next'] = -df.index.to_series().diff(-1).dt.total_seconds()

df

Output:

                                 some_metric  seconds_until_next
event_time                                                      
2021-02-24 20:53:14.572000+00:00        2362             553.995
2021-02-24 21:02:28.567000+00:00        4264               1.005
2021-02-24 21:02:29.572000+00:00        5160               0.989
2021-02-24 21:02:30.561000+00:00        6183              85.045
2021-02-24 21:03:55.606000+00:00        9654                 NaN

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150735

That's because when you do reset_index, you got a different index (RangeIndex) which doesn't align with the original df and you get all NaN values. Try to_series:

df['time_gap'] = df.index.to_series().shift(-1).sub(df.index).dt.total_seconds()

Output:

                                 some_metric  time_gap
event_time                                            
2021-02-24 20:53:14.572000+00:00        2362   553.995
2021-02-24 21:02:28.567000+00:00        4264     1.005
2021-02-24 21:02:29.572000+00:00        5160     0.989
2021-02-24 21:02:30.561000+00:00        6183    85.045
2021-02-24 21:03:55.606000+00:00        9654       NaN

Upvotes: 1

Related Questions