Reputation: 313

filling last known timeseries data using pandas

this is a variation on a similar question i asked: filling last known data with pandas

in a nutshell, i wanted to know how to forward fill timeseries data, while noting the ID of each data point.

ergo, this

2014-07-24 17:49:00   5   1046.0   -3.0   -239.0   2800.0
...
2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     NaN     NaN     NaN    2623

becomes

2015-05-05 15:00:00   2     NaN     NaN     NaN    2680 
2015-05-05 15:00:00   3     0989      0020     -0011    2680
2015-05-05 15:00:00   4    1022      0060     -0076    2600 
2015-05-05 15:00:00   5     1046     -3.0     -239.0    2623

noting that the last known data for ID=5 was from 2014-07-24 17:49:00

the variation now would be to do the same thing, only that it should consider a "validity period" for the data. what i tried doing was assigning a datetimeIndex and then slicing the dataframe from that vaild time period df[start:end] and then doing the fix on my previous question.

This however resulted in a separate subset than my "big" dataframe. what i need is to do the operation on my "big dataframe" and be able to move this window and move through all of the data.

Upvotes: 0

Answers (2)

unutbu

Reputation: 880777

Pandas has a TimeGrouper object which can help you group DataFrames which have a DatetimeIndex by intervals of time. Groupby operations can then be nested to further group by 'id':

import numpy as np
import pandas as pd

df = pd.DataFrame([['2014-07-24 17:49:00', 5, 1046.0, -3, -239, 2800],
                   ['2015-05-05 15:00:00', 2, np.nan, np.nan,np.nan, 2680],
                   ['2015-05-05 15:00:00', 3,  989, 20, -11, 2680], 
                   ['2015-05-05 15:00:00', 4, 1022, 60, -76, 2600], 
                   ['2015-05-05 15:00:00', 5, np.nan, np.nan, np.nan, 2623]], 
                  columns='timestamp id A B C D'.split())
df['timestamp'] = pd.DatetimeIndex(df['timestamp'])
df = df.set_index(['timestamp'])

print(df.groupby(pd.TimeGrouper('300D'), group_keys=False)
      .apply(lambda grp: grp.set_index(['id'], append=True)
             .groupby(level='id').ffill()))

yields

                           A   B    C     D
timestamp           id                     
2014-07-24 17:49:00 5   1046  -3 -239  2800
2015-05-05 15:00:00 2    NaN NaN  NaN  2680
                    3    989  20  -11  2680
                    4   1022  60  -76  2600
                    5   1046  -3 -239  2623

Upvotes: 0

EdChum

Reputation: 394409

You can groupby the 'id' column and then call ffill:

In [95]:
df.groupby(['id'], as_index=False).ffill()

Out[95]:
             datetime  id     a   b    c     d
0 2014-07-24 17:49:00   5  1046  -3 -239  2800
1 2015-05-05 15:00:00   2   NaN NaN  NaN  2680
2 2015-05-05 15:00:00   3   989  20  -11  2680
3 2015-05-05 15:00:00   4  1022  60  -76  2600
4 2015-05-05 15:00:00   5  1046  -3 -239  2623

Upvotes: 1

filling last known timeseries data using pandas

Answers (2)

Related Questions