Reputation: 1369
Lets's say I have a Pandas dataframe where the index is a datetime value. I want to add a column that is the calculation of the total_seconds elapsed between each subsequent record.
Problem setup:
import pandas as pd
df = pd.DataFrame(
data=[
["2021-02-24 20:53:14.572000+00:00", "2362"],
["2021-02-24 21:02:28.567000+00:00", "4264"],
["2021-02-24 21:02:29.572000+00:00", "5160"],
["2021-02-24 21:02:30.561000+00:00", "6183"],
["2021-02-24 21:03:55.606000+00:00", "9654"],
],
columns=["event_time", "some_metric"],
)
# Make the timestamp our index and make sure the events are in order.
df["event_time"] = pd.to_datetime(df["event_time"])
df = df.set_index("event_time")
df = df.sort_index()
So now my dataframe looks like this:
some_metric
event_time
2021-02-24 20:53:14.572000+00:00 2362
2021-02-24 21:02:28.567000+00:00 4264
2021-02-24 21:02:29.572000+00:00 5160
2021-02-24 21:02:30.561000+00:00 6183
2021-02-24 21:03:55.606000+00:00 9654
Now I want to add a new column that is the seconds elapsed until the next event.
Here's what I'm trying, which runs without error::
df["seconds_until_next"] = (
df.reset_index()["event_time"].shift(-1) - df.reset_index()["event_time"]
).dt.total_seconds()
But the resulting dataframe looks like this, with all NaNs in the new column:
some_metric seconds_until_next
event_time
2021-02-24 20:53:14.572000+00:00 2362 NaN
2021-02-24 21:02:28.567000+00:00 4264 NaN
2021-02-24 21:02:29.572000+00:00 5160 NaN
2021-02-24 21:02:30.561000+00:00 6183 NaN
2021-02-24 21:03:55.606000+00:00 9654 NaN
Which is weird, because just running the right side of that operation looks like it returns the values I want:
(df.reset_index()["event_time"].shift(-1) - df.reset_index()["event_time"]).dt.total_seconds()
returns:
0 553.995
1 1.005
2 0.989
3 85.045
4 NaN
Name: event_time, dtype: float64
What's going on here? I assume it's because of a mismatch in index values between the df on the left side of the equals and the result on the right side? How do I resolve that?
Update:
These answers are great, wish I could give everyone the points. I feel like the magic piece of knowledge I was missing was .to_series()
. Will definitely keep the other suggestions in mind. This way works and feels good in my brain, not sure if it's the most performant:
df["seconds_until_next"] = (
df.index.to_series().shift(-1) - df.index.to_series()
).dt.total_seconds()
Upvotes: 1
Views: 52
Reputation: 31146
There a subtle issue - will be a range index which does not match datetime index. Worked around by making it an array with .values
df.assign(seconds_until_next=(pd.Series(df.index).shift(-1) - df.index).dt.total_seconds().values)
event_time | some_metric | seconds_until_next |
---|---|---|
2021-02-24 20:53:14.572000+00:00 | 2362 | 553.995 |
2021-02-24 21:02:28.567000+00:00 | 4264 | 1.005 |
2021-02-24 21:02:29.572000+00:00 | 5160 | 0.989 |
2021-02-24 21:02:30.561000+00:00 | 6183 | 85.045 |
2021-02-24 21:03:55.606000+00:00 | 9654 | nan |
Upvotes: 0
Reputation: 9941
A similar (but slightly more concise) option with diff
(instead of shift
ing and subtracting):
df['seconds_until_next'] = -df.index.to_series().diff(-1).dt.total_seconds()
df
Output:
some_metric seconds_until_next
event_time
2021-02-24 20:53:14.572000+00:00 2362 553.995
2021-02-24 21:02:28.567000+00:00 4264 1.005
2021-02-24 21:02:29.572000+00:00 5160 0.989
2021-02-24 21:02:30.561000+00:00 6183 85.045
2021-02-24 21:03:55.606000+00:00 9654 NaN
Upvotes: 2
Reputation: 150735
That's because when you do reset_index
, you got a different index (RangeIndex) which doesn't align with the original df
and you get all NaN
values. Try to_series
:
df['time_gap'] = df.index.to_series().shift(-1).sub(df.index).dt.total_seconds()
Output:
some_metric time_gap
event_time
2021-02-24 20:53:14.572000+00:00 2362 553.995
2021-02-24 21:02:28.567000+00:00 4264 1.005
2021-02-24 21:02:29.572000+00:00 5160 0.989
2021-02-24 21:02:30.561000+00:00 6183 85.045
2021-02-24 21:03:55.606000+00:00 9654 NaN
Upvotes: 1