Reputation: 193
Here is a time series we can interpret as state transitions:
states = list('abc')
transition_times = [
pd.to_datetime('00:01:00'),
pd.to_datetime('00:03:10'),
pd.to_datetime('00:05:00'),
]
df = pd.DataFrame({'state': states}, index=transition_times);df
Out:
state
2018-01-11 00:01:00 a
2018-01-11 00:03:10 b
2018-01-11 00:05:00 c
Suppose I now want a minute indexed timeline of the system's state from 00:01:00 to 00:05:00. My idea was :
df.resample('1min').ffill()
Out:
state
2018-01-11 00:01:00 a
2018-01-11 00:02:00 a
2018-01-11 00:03:00 a <- I would expect 'b' here !
2018-01-11 00:04:00 b
2018-01-11 00:05:00 c
As annotated, could someone explain why the [3min,4min[ bin was filled forward with 'a' ?
I can reveal the gaps that I expected to be filled by doing:
df.resample('1min').max()
Out:
state
2018-01-11 00:01:00 a
2018-01-11 00:02:00 NaN
2018-01-11 00:03:00 b
2018-01-11 00:04:00 NaN
2018-01-11 00:05:00 c
And obtaining the desired result with:
df.resample('1min').max().fillna(method='ffill')
Out:
state
2018-01-11 00:01:00 a
2018-01-11 00:02:00 a
2018-01-11 00:03:00 b
2018-01-11 00:04:00 b
2018-01-11 00:05:00 c
Thanks a lot!
Quentin
Upvotes: 2
Views: 462
Reputation: 6213
Forward fill is not really overwriting the existing value in this case.
Pandas assume that any NaN
value should be equivalent to the previous value. In principle, 00:03:00
was indeed NaN
(we have no entry for it), so it gets correctly filled with it's previous value a
.
I believe the confusion is how the final minutely "aggregation" or the downsampling occurs (ie: how are the multiple values occurring during minute 3 combined). In this case, the value of the exact minute (00 seconds) is taken
The following might explain the process:
I have added an entry to your dataframe:
states = list('abcd')
transition_times = [
pd.to_datetime('00:01:00'),
pd.to_datetime('00:03:10'),
pd.to_datetime('00:03:30'),
pd.to_datetime('00:05:00'),
]
df = pd.DataFrame({'state': states}, index=transition_times);df
And here is a manual illustration:
df.resample("10s").asfreq()
# state
# 2018-01-11 00:01:00 a
# 2018-01-11 00:01:10 NaN
# 2018-01-11 00:01:20 NaN
# 2018-01-11 00:01:30 NaN
# 2018-01-11 00:01:40 NaN
# 2018-01-11 00:01:50 NaN
# 2018-01-11 00:02:00 NaN
# 2018-01-11 00:02:10 NaN
# 2018-01-11 00:02:20 NaN
# 2018-01-11 00:02:30 NaN
# 2018-01-11 00:02:40 NaN
# 2018-01-11 00:02:50 NaN
# 2018-01-11 00:03:00 NaN
# 2018-01-11 00:03:10 b
# 2018-01-11 00:03:20 NaN
# 2018-01-11 00:03:30 c
# 2018-01-11 00:03:40 NaN
# 2018-01-11 00:03:50 NaN
# 2018-01-11 00:04:00 NaN
# 2018-01-11 00:04:10 NaN
# 2018-01-11 00:04:20 NaN
# 2018-01-11 00:04:30 NaN
# 2018-01-11 00:04:40 NaN
# 2018-01-11 00:04:50 NaN
# 2018-01-11 00:05:00 d
# Forward fill
df_ffill = df.resample("10s").asfreq().ffill()
# state
# 2018-01-11 00:01:00 a
# 2018-01-11 00:01:10 a
# 2018-01-11 00:01:20 a
# 2018-01-11 00:01:30 a
# 2018-01-11 00:01:40 a
# 2018-01-11 00:01:50 a
# 2018-01-11 00:02:00 a
# 2018-01-11 00:02:10 a
# 2018-01-11 00:02:20 a
# 2018-01-11 00:02:30 a
# 2018-01-11 00:02:40 a
# 2018-01-11 00:02:50 a
# 2018-01-11 00:03:00 a
# 2018-01-11 00:03:10 b
# 2018-01-11 00:03:20 b
# 2018-01-11 00:03:30 c
# 2018-01-11 00:03:40 c
# 2018-01-11 00:03:50 c
# 2018-01-11 00:04:00 c
# 2018-01-11 00:04:10 c
# 2018-01-11 00:04:20 c
# 2018-01-11 00:04:30 c
# 2018-01-11 00:04:40 c
# 2018-01-11 00:04:50 c
# 2018-01-11 00:05:00 d
# Manual Downsample
df_ffill[df_ffill.index.second == 0]
# state
# 2018-01-11 00:01:00 a
# 2018-01-11 00:02:00 a
# 2018-01-11 00:03:00 a
# 2018-01-11 00:04:00 c
# 2018-01-11 00:05:00 d
# ----------------------------------------
df.resample("1min").ffill()
# state
# 2018-01-11 00:01:00 a
# 2018-01-11 00:02:00 a
# 2018-01-11 00:03:00 a
# 2018-01-11 00:04:00 c
# 2018-01-11 00:05:00 d
Upvotes: 1
Reputation: 2162
.ffill
(ie forward fill) will use the last observation to fill forwards from the first item until it is replaced by a new observation. Moving forwards through your timestamps, at the 00:03:00
timestamp the "b" observation hasn't happened yet so it uses the last valid observation ("a").
The opposite of this method is .bfill
(backfill) which works in the reverse order to fill backwards, eg:
In: df.resample('1min').bfill()
Out:
state
2018-01-11 00:01:00 a
2018-01-11 00:02:00 b
2018-01-11 00:03:00 b
2018-01-11 00:04:00 c
2018-01-11 00:05:00 c
More details in the docs: pandas.DataFrame.fillna
Upvotes: 3