fredbaba
fredbaba

Reputation: 1496

Properly shifting irregular time series in Pandas

What's the proper way to shift this time series, and re-align the data to the same index? E.g. How would I generate the data frame with the same index values as "data," but where the value at each point was the last value seen as of 0.4 seconds after the index timestamp?

I'd expect this to be a rather common operation among people dealing with irregular and mixed frequency time series ("what's the last value as of an arbitrary time offset to my current time?"), so I would expect (hope for?) this functionality to exist...

Suppose I have the following data frame:

>>> import pandas as pd
>>> import numpy as np
>>> import time
>>> 
>>> x = np.arange(10)
>>> #t = time.time() + x + np.random.randn(10)
... t = np.array([1467421851418745856, 1467421852687532544, 1467421853288187136,
...        1467421854838806528, 1467421855148979456, 1467421856415879424,
...        1467421857259467264, 1467421858375025408, 1467421859019387904,
...        1467421860235784448])
>>> data = pd.DataFrame({"x": x})
>>> data.index = pd.to_datetime(t)
>>> data["orig_time"] = data.index
>>> data
                               x                     orig_time
2016-07-02 01:10:51.418745856  0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:52.687532544  1 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.288187136  2 2016-07-02 01:10:53.288187136
2016-07-02 01:10:54.838806528  3 2016-07-02 01:10:54.838806528
2016-07-02 01:10:55.148979456  4 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.415879424  5 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.259467264  6 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.375025408  7 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.019387904  8 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.235784448  9 2016-07-02 01:11:00.235784448

I can write the following function:

def time_shift(df, delta):
    """Shift a DataFrame object such that each row contains the last known
    value as of the time `df.index + delta`."""
    lookup_index = df.index + delta
    mapped_indicies = np.searchsorted(df.index, lookup_index, side='left')
    # Clamp bounds to allow us to index into the original DataFrame
    cleaned_indicies = np.clip(mapped_indicies, 0, 
                               len(mapped_indicies) - 1)
    # Since searchsorted gives us an insertion point, we'll generally
    # have to shift back by one to get the last value prior to the
    # insertion point. I choose to keep contemporaneous values,
    # rather than looking back one, but that's a matter of personal
    # preference.
    lookback = np.where(lookup_index < df.index[cleaned_indicies], 1, 0)
    # And remember to re-clip to avoid index errors...
    cleaned_indicies = np.clip(cleaned_indicies - lookback, 0, 
                               len(mapped_indicies) - 1)

    new_df = df.iloc[cleaned_indicies]
    # We don't know what the value was before the beginning...
    new_df.iloc[lookup_index < df.index[0]] = np.NaN
    # We don't know what the value was after the end...
    new_df.iloc[mapped_indicies >= len(mapped_indicies)] = np.NaN
    new_df.index = df.index

    return new_df

with the desired behavior:

>>> time_shift(data, pd.Timedelta('0.4s'))
                                 x                     orig_time
2016-07-02 01:10:51.418745856  0.0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:52.687532544  1.0 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.288187136  2.0 2016-07-02 01:10:53.288187136
2016-07-02 01:10:54.838806528  4.0 2016-07-02 01:10:55.148979456
2016-07-02 01:10:55.148979456  4.0 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.415879424  5.0 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.259467264  6.0 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.375025408  7.0 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.019387904  8.0 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.235784448  NaN                           NaT

As you can see, getting this calculation right is a bit tricky, so I'd much prefer a supported implementation vs. 'rolling my own'.

This doesn't work. It shifts truncates the first argument and shifts all rows by 0 positions:

>>> data.shift(0.4)
                                 x                     orig_time
2016-07-02 01:10:51.418745856  0.0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:52.687532544  1.0 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.288187136  2.0 2016-07-02 01:10:53.288187136
2016-07-02 01:10:54.838806528  3.0 2016-07-02 01:10:54.838806528
2016-07-02 01:10:55.148979456  4.0 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.415879424  5.0 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.259467264  6.0 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.375025408  7.0 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.019387904  8.0 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.235784448  9.0 2016-07-02 01:11:00.235784448

This is just adds an offset to data.index...:

>>> data.shift(1, pd.Timedelta("0.4s"))
                               x                     orig_time
2016-07-02 01:10:51.818745856  0 2016-07-02 01:10:51.418745856
2016-07-02 01:10:53.087532544  1 2016-07-02 01:10:52.687532544
2016-07-02 01:10:53.688187136  2 2016-07-02 01:10:53.288187136
2016-07-02 01:10:55.238806528  3 2016-07-02 01:10:54.838806528
2016-07-02 01:10:55.548979456  4 2016-07-02 01:10:55.148979456
2016-07-02 01:10:56.815879424  5 2016-07-02 01:10:56.415879424
2016-07-02 01:10:57.659467264  6 2016-07-02 01:10:57.259467264
2016-07-02 01:10:58.775025408  7 2016-07-02 01:10:58.375025408
2016-07-02 01:10:59.419387904  8 2016-07-02 01:10:59.019387904
2016-07-02 01:11:00.635784448  9 2016-07-02 01:11:00.235784448

And this results in Na's for all time points:

>>> data.shift(1, pd.Timedelta("0.4s")).reindex(data.index)
                                x orig_time
2016-07-02 01:10:51.418745856 NaN       NaT
2016-07-02 01:10:52.687532544 NaN       NaT
2016-07-02 01:10:53.288187136 NaN       NaT
2016-07-02 01:10:54.838806528 NaN       NaT
2016-07-02 01:10:55.148979456 NaN       NaT
2016-07-02 01:10:56.415879424 NaN       NaT
2016-07-02 01:10:57.259467264 NaN       NaT
2016-07-02 01:10:58.375025408 NaN       NaT
2016-07-02 01:10:59.019387904 NaN       NaT
2016-07-02 01:11:00.235784448 NaN       NaT

Upvotes: 3

Views: 6684

Answers (2)

Merlin
Merlin

Reputation: 25659

Using chrisaycock answer.. Your data is below 0.4s intervals. So, your results are correct. 1s shows it works.

pd.Series(x, data.index).asof(data.index + pd.Timedelta('1s'))

#     2016-07-02 01:10:52.418745856    0
#     2016-07-02 01:10:53.687532544    2
#     2016-07-02 01:10:54.288187136    2
#     2016-07-02 01:10:55.838806528    4
#     2016-07-02 01:10:56.148979456    4
#     2016-07-02 01:10:57.415879424    6
#     2016-07-02 01:10:58.259467264    6
#     2016-07-02 01:10:59.375025408    8
#     2016-07-02 01:11:00.019387904    8
#     2016-07-02 01:11:01.235784448    9

Upvotes: 0

chrisaycock
chrisaycock

Reputation: 37928

Just like on this question, you are asking for an asof-join. Fortunately, the next release of pandas (soon-ish) will have it! Until then, you can use a pandas Series to determine the value you want.

Original DataFrame:

In [44]: data
Out[44]: 
                               x
2016-07-02 13:27:05.249071616  0
2016-07-02 13:27:07.280549376  1
2016-07-02 13:27:08.666985984  2
2016-07-02 13:27:08.410521856  3
2016-07-02 13:27:09.896294912  4
2016-07-02 13:27:10.159203328  5
2016-07-02 13:27:10.492438784  6
2016-07-02 13:27:13.790925312  7
2016-07-02 13:27:13.896483072  8
2016-07-02 13:27:13.598456064  9

Convert to Series:

In [45]: ser = pd.Series(data.x, data.index)

In [46]: ser
Out[46]: 
2016-07-02 13:27:05.249071616    0
2016-07-02 13:27:07.280549376    1
2016-07-02 13:27:08.666985984    2
2016-07-02 13:27:08.410521856    3
2016-07-02 13:27:09.896294912    4
2016-07-02 13:27:10.159203328    5
2016-07-02 13:27:10.492438784    6
2016-07-02 13:27:13.790925312    7
2016-07-02 13:27:13.896483072    8
2016-07-02 13:27:13.598456064    9
Name: x, dtype: int64

Use the asof function:

In [47]: ser.asof(ser.index + pd.Timedelta('4s'))
Out[47]: 
2016-07-02 13:27:09.249071616    3
2016-07-02 13:27:11.280549376    6
2016-07-02 13:27:12.666985984    6
2016-07-02 13:27:12.410521856    6
2016-07-02 13:27:13.896294912    7
2016-07-02 13:27:14.159203328    9
2016-07-02 13:27:14.492438784    9
2016-07-02 13:27:17.790925312    9
2016-07-02 13:27:17.896483072    9
2016-07-02 13:27:17.598456064    9
Name: x, dtype: int64

(I used four seconds above to make the example easier to read.)

Upvotes: 3

Related Questions