.ffill() overwriting existing values

Question

Here is a time series we can interpret as state transitions:

states = list('abc')
transition_times = [
  pd.to_datetime('00:01:00'),
  pd.to_datetime('00:03:10'),
  pd.to_datetime('00:05:00'),
]
df = pd.DataFrame({'state': states}, index=transition_times);df


Out:
                            state
     2018-01-11 00:01:00    a
     2018-01-11 00:03:10    b
     2018-01-11 00:05:00    c

Suppose I now want a minute indexed timeline of the system's state from 00:01:00 to 00:05:00. My idea was :

df.resample('1min').ffill()

Out:
                            state
     2018-01-11 00:01:00    a
     2018-01-11 00:02:00    a
     2018-01-11 00:03:00    a  <- I would expect 'b' here !
     2018-01-11 00:04:00    b
     2018-01-11 00:05:00    c

As annotated, could someone explain why the [3min,4min[ bin was filled forward with 'a' ?

I can reveal the gaps that I expected to be filled by doing:

df.resample('1min').max()

Out:
                    state
     2018-01-11 00:01:00    a
     2018-01-11 00:02:00    NaN
     2018-01-11 00:03:00    b
     2018-01-11 00:04:00    NaN
     2018-01-11 00:05:00    c

And obtaining the desired result with:

df.resample('1min').max().fillna(method='ffill')

Out:
                    state
     2018-01-11 00:01:00    a
     2018-01-11 00:02:00    a
     2018-01-11 00:03:00    b
     2018-01-11 00:04:00    b
     2018-01-11 00:05:00    c

Thanks a lot!

Quentin

Deena · Accepted Answer

Forward fill is not really overwriting the existing value in this case.

Pandas assume that any NaN value should be equivalent to the previous value. In principle, 00:03:00 was indeed NaN (we have no entry for it), so it gets correctly filled with it's previous value a.

I believe the confusion is how the final minutely "aggregation" or the downsampling occurs (ie: how are the multiple values occurring during minute 3 combined). In this case, the value of the exact minute (00 seconds) is taken

The following might explain the process:

I have added an entry to your dataframe:

states = list('abcd')
transition_times = [
  pd.to_datetime('00:01:00'),
  pd.to_datetime('00:03:10'),
  pd.to_datetime('00:03:30'),
  pd.to_datetime('00:05:00'),
]

df = pd.DataFrame({'state': states}, index=transition_times);df

And here is a manual illustration:

df.resample("10s").asfreq()
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:01:10   NaN
# 2018-01-11 00:01:20   NaN
# 2018-01-11 00:01:30   NaN
# 2018-01-11 00:01:40   NaN
# 2018-01-11 00:01:50   NaN
# 2018-01-11 00:02:00   NaN
# 2018-01-11 00:02:10   NaN
# 2018-01-11 00:02:20   NaN
# 2018-01-11 00:02:30   NaN
# 2018-01-11 00:02:40   NaN
# 2018-01-11 00:02:50   NaN
# 2018-01-11 00:03:00   NaN
# 2018-01-11 00:03:10     b
# 2018-01-11 00:03:20   NaN
# 2018-01-11 00:03:30     c
# 2018-01-11 00:03:40   NaN
# 2018-01-11 00:03:50   NaN
# 2018-01-11 00:04:00   NaN
# 2018-01-11 00:04:10   NaN
# 2018-01-11 00:04:20   NaN
# 2018-01-11 00:04:30   NaN
# 2018-01-11 00:04:40   NaN
# 2018-01-11 00:04:50   NaN
# 2018-01-11 00:05:00     d

# Forward fill 
df_ffill = df.resample("10s").asfreq().ffill()
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:01:10     a
# 2018-01-11 00:01:20     a
# 2018-01-11 00:01:30     a
# 2018-01-11 00:01:40     a
# 2018-01-11 00:01:50     a
# 2018-01-11 00:02:00     a
# 2018-01-11 00:02:10     a
# 2018-01-11 00:02:20     a
# 2018-01-11 00:02:30     a
# 2018-01-11 00:02:40     a
# 2018-01-11 00:02:50     a
# 2018-01-11 00:03:00     a
# 2018-01-11 00:03:10     b
# 2018-01-11 00:03:20     b
# 2018-01-11 00:03:30     c
# 2018-01-11 00:03:40     c
# 2018-01-11 00:03:50     c
# 2018-01-11 00:04:00     c
# 2018-01-11 00:04:10     c
# 2018-01-11 00:04:20     c
# 2018-01-11 00:04:30     c
# 2018-01-11 00:04:40     c
# 2018-01-11 00:04:50     c
# 2018-01-11 00:05:00     d

# Manual Downsample
df_ffill[df_ffill.index.second == 0]
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:02:00     a
# 2018-01-11 00:03:00     a
# 2018-01-11 00:04:00     c
# 2018-01-11 00:05:00     d

# ----------------------------------------

df.resample("1min").ffill()
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:02:00     a
# 2018-01-11 00:03:00     a
# 2018-01-11 00:04:00     c
# 2018-01-11 00:05:00     d

.ffill() overwriting existing values

Answers (2)

Related Questions