Reputation: 1496
What's the proper way to shift this time series, and re-align the data to the same index? E.g. How would I generate the data frame with the same index values as "data," but where the value at each point was the last value seen as of 0.4 seconds after the index timestamp?
I'd expect this to be a rather common operation among people dealing with irregular and mixed frequency time series ("what's the last value as of an arbitrary time offset to my current time?"), so I would expect (hope for?) this functionality to exist...
Suppose I have the following data frame:
>>> import pandas as pd
>>> import numpy as np
>>> import time
>>>
>>> x = np.arange(10)
>>> #t = time.time() + x + np.random.randn(10)
... t = np.array([1467421851418745856, 1467421852687532544, 1467421853288187136,
... 1467421854838806528, 1467421855148979456, 1467421856415879424,
... 1467421857259467264, 1467421858375025408, 1467421859019387904,
... 1467421860235784448])
>>> data = pd.DataFrame({"x": x})
>>> data.index = pd.to_datetime(t)
>>> data["orig_time"] = data.index
>>> data
x orig_time
2016-07-02 01:10:51.418745856 0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:52.687532544 1 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.288187136 2 2016-07-02 01:10:53.288187136
2016-07-02 01:10:54.838806528 3 2016-07-02 01:10:54.838806528
2016-07-02 01:10:55.148979456 4 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.415879424 5 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.259467264 6 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.375025408 7 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.019387904 8 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.235784448 9 2016-07-02 01:11:00.235784448
I can write the following function:
def time_shift(df, delta):
"""Shift a DataFrame object such that each row contains the last known
value as of the time `df.index + delta`."""
lookup_index = df.index + delta
mapped_indicies = np.searchsorted(df.index, lookup_index, side='left')
# Clamp bounds to allow us to index into the original DataFrame
cleaned_indicies = np.clip(mapped_indicies, 0,
len(mapped_indicies) - 1)
# Since searchsorted gives us an insertion point, we'll generally
# have to shift back by one to get the last value prior to the
# insertion point. I choose to keep contemporaneous values,
# rather than looking back one, but that's a matter of personal
# preference.
lookback = np.where(lookup_index < df.index[cleaned_indicies], 1, 0)
# And remember to re-clip to avoid index errors...
cleaned_indicies = np.clip(cleaned_indicies - lookback, 0,
len(mapped_indicies) - 1)
new_df = df.iloc[cleaned_indicies]
# We don't know what the value was before the beginning...
new_df.iloc[lookup_index < df.index[0]] = np.NaN
# We don't know what the value was after the end...
new_df.iloc[mapped_indicies >= len(mapped_indicies)] = np.NaN
new_df.index = df.index
return new_df
with the desired behavior:
>>> time_shift(data, pd.Timedelta('0.4s'))
x orig_time
2016-07-02 01:10:51.418745856 0.0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:52.687532544 1.0 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.288187136 2.0 2016-07-02 01:10:53.288187136
2016-07-02 01:10:54.838806528 4.0 2016-07-02 01:10:55.148979456
2016-07-02 01:10:55.148979456 4.0 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.415879424 5.0 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.259467264 6.0 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.375025408 7.0 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.019387904 8.0 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.235784448 NaN NaT
As you can see, getting this calculation right is a bit tricky, so I'd much prefer a supported implementation vs. 'rolling my own'.
This doesn't work. It shifts truncates the first argument and shifts all rows by 0 positions:
>>> data.shift(0.4)
x orig_time
2016-07-02 01:10:51.418745856 0.0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:52.687532544 1.0 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.288187136 2.0 2016-07-02 01:10:53.288187136
2016-07-02 01:10:54.838806528 3.0 2016-07-02 01:10:54.838806528
2016-07-02 01:10:55.148979456 4.0 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.415879424 5.0 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.259467264 6.0 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.375025408 7.0 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.019387904 8.0 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.235784448 9.0 2016-07-02 01:11:00.235784448
This is just adds an offset to data.index...:
>>> data.shift(1, pd.Timedelta("0.4s"))
x orig_time
2016-07-02 01:10:51.818745856 0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:53.087532544 1 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.688187136 2 2016-07-02 01:10:53.288187136
2016-07-02 01:10:55.238806528 3 2016-07-02 01:10:54.838806528
2016-07-02 01:10:55.548979456 4 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.815879424 5 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.659467264 6 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.775025408 7 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.419387904 8 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.635784448 9 2016-07-02 01:11:00.235784448
And this results in Na's for all time points:
>>> data.shift(1, pd.Timedelta("0.4s")).reindex(data.index)
x orig_time
2016-07-02 01:10:51.418745856 NaN NaT
2016-07-02 01:10:52.687532544 NaN NaT
2016-07-02 01:10:53.288187136 NaN NaT
2016-07-02 01:10:54.838806528 NaN NaT
2016-07-02 01:10:55.148979456 NaN NaT
2016-07-02 01:10:56.415879424 NaN NaT
2016-07-02 01:10:57.259467264 NaN NaT
2016-07-02 01:10:58.375025408 NaN NaT
2016-07-02 01:10:59.019387904 NaN NaT
2016-07-02 01:11:00.235784448 NaN NaT
Upvotes: 3
Views: 6684
Reputation: 25659
Using chrisaycock answer.. Your data is below 0.4s
intervals. So, your results are correct. 1s
shows it works.
pd.Series(x, data.index).asof(data.index + pd.Timedelta('1s'))
# 2016-07-02 01:10:52.418745856 0
# 2016-07-02 01:10:53.687532544 2
# 2016-07-02 01:10:54.288187136 2
# 2016-07-02 01:10:55.838806528 4
# 2016-07-02 01:10:56.148979456 4
# 2016-07-02 01:10:57.415879424 6
# 2016-07-02 01:10:58.259467264 6
# 2016-07-02 01:10:59.375025408 8
# 2016-07-02 01:11:00.019387904 8
# 2016-07-02 01:11:01.235784448 9
Upvotes: 0
Reputation: 37928
Just like on this question, you are asking for an asof-join. Fortunately, the next release of pandas (soon-ish) will have it! Until then, you can use a pandas Series to determine the value you want.
Original DataFrame:
In [44]: data
Out[44]:
x
2016-07-02 13:27:05.249071616 0
2016-07-02 13:27:07.280549376 1
2016-07-02 13:27:08.666985984 2
2016-07-02 13:27:08.410521856 3
2016-07-02 13:27:09.896294912 4
2016-07-02 13:27:10.159203328 5
2016-07-02 13:27:10.492438784 6
2016-07-02 13:27:13.790925312 7
2016-07-02 13:27:13.896483072 8
2016-07-02 13:27:13.598456064 9
Convert to Series:
In [45]: ser = pd.Series(data.x, data.index)
In [46]: ser
Out[46]:
2016-07-02 13:27:05.249071616 0
2016-07-02 13:27:07.280549376 1
2016-07-02 13:27:08.666985984 2
2016-07-02 13:27:08.410521856 3
2016-07-02 13:27:09.896294912 4
2016-07-02 13:27:10.159203328 5
2016-07-02 13:27:10.492438784 6
2016-07-02 13:27:13.790925312 7
2016-07-02 13:27:13.896483072 8
2016-07-02 13:27:13.598456064 9
Name: x, dtype: int64
Use the asof
function:
In [47]: ser.asof(ser.index + pd.Timedelta('4s'))
Out[47]:
2016-07-02 13:27:09.249071616 3
2016-07-02 13:27:11.280549376 6
2016-07-02 13:27:12.666985984 6
2016-07-02 13:27:12.410521856 6
2016-07-02 13:27:13.896294912 7
2016-07-02 13:27:14.159203328 9
2016-07-02 13:27:14.492438784 9
2016-07-02 13:27:17.790925312 9
2016-07-02 13:27:17.896483072 9
2016-07-02 13:27:17.598456064 9
Name: x, dtype: int64
(I used four seconds above to make the example easier to read.)
Upvotes: 3