Reputation: 313
this is a variation on a similar question i asked: filling last known data with pandas
in a nutshell, i wanted to know how to forward fill timeseries data, while noting the ID of each data point.
ergo, this
2014-07-24 17:49:00 5 1046.0 -3.0 -239.0 2800.0
...
2015-05-05 15:00:00 2 NaN NaN NaN 2680
2015-05-05 15:00:00 3 0989 0020 -0011 2680
2015-05-05 15:00:00 4 1022 0060 -0076 2600
2015-05-05 15:00:00 5 NaN NaN NaN 2623
becomes
2015-05-05 15:00:00 2 NaN NaN NaN 2680
2015-05-05 15:00:00 3 0989 0020 -0011 2680
2015-05-05 15:00:00 4 1022 0060 -0076 2600
2015-05-05 15:00:00 5 1046 -3.0 -239.0 2623
noting that the last known data for ID=5 was from 2014-07-24 17:49:00
the variation now would be to do the same thing, only that it should consider a "validity period" for the data. what i tried doing was assigning a datetimeIndex
and then slicing the dataframe from that vaild time period df[start:end]
and then doing the fix on my previous question.
This however resulted in a separate subset than my "big" dataframe. what i need is to do the operation on my "big dataframe" and be able to move this window and move through all of the data.
Upvotes: 0
Views: 57
Reputation: 880777
Pandas has a TimeGrouper object which can help you group DataFrames which have a
DatetimeIndex by intervals of time.
Groupby operations can then be
nested to further group by 'id'
:
import numpy as np
import pandas as pd
df = pd.DataFrame([['2014-07-24 17:49:00', 5, 1046.0, -3, -239, 2800],
['2015-05-05 15:00:00', 2, np.nan, np.nan,np.nan, 2680],
['2015-05-05 15:00:00', 3, 989, 20, -11, 2680],
['2015-05-05 15:00:00', 4, 1022, 60, -76, 2600],
['2015-05-05 15:00:00', 5, np.nan, np.nan, np.nan, 2623]],
columns='timestamp id A B C D'.split())
df['timestamp'] = pd.DatetimeIndex(df['timestamp'])
df = df.set_index(['timestamp'])
print(df.groupby(pd.TimeGrouper('300D'), group_keys=False)
.apply(lambda grp: grp.set_index(['id'], append=True)
.groupby(level='id').ffill()))
yields
A B C D
timestamp id
2014-07-24 17:49:00 5 1046 -3 -239 2800
2015-05-05 15:00:00 2 NaN NaN NaN 2680
3 989 20 -11 2680
4 1022 60 -76 2600
5 1046 -3 -239 2623
Upvotes: 0
Reputation: 394409
You can groupby
the 'id' column and then call ffill
:
In [95]:
df.groupby(['id'], as_index=False).ffill()
Out[95]:
datetime id a b c d
0 2014-07-24 17:49:00 5 1046 -3 -239 2800
1 2015-05-05 15:00:00 2 NaN NaN NaN 2680
2 2015-05-05 15:00:00 3 989 20 -11 2680
3 2015-05-05 15:00:00 4 1022 60 -76 2600
4 2015-05-05 15:00:00 5 1046 -3 -239 2623
Upvotes: 1