kwentine
kwentine

Reputation: 193

.ffill() overwriting existing values

Here is a time series we can interpret as state transitions:

states = list('abc')
transition_times = [
  pd.to_datetime('00:01:00'),
  pd.to_datetime('00:03:10'),
  pd.to_datetime('00:05:00'),
]
df = pd.DataFrame({'state': states}, index=transition_times);df


Out:
                            state
     2018-01-11 00:01:00    a
     2018-01-11 00:03:10    b
     2018-01-11 00:05:00    c

Suppose I now want a minute indexed timeline of the system's state from 00:01:00 to 00:05:00. My idea was :

df.resample('1min').ffill()

Out:
                            state
     2018-01-11 00:01:00    a
     2018-01-11 00:02:00    a
     2018-01-11 00:03:00    a  <- I would expect 'b' here !
     2018-01-11 00:04:00    b
     2018-01-11 00:05:00    c

As annotated, could someone explain why the [3min,4min[ bin was filled forward with 'a' ?

I can reveal the gaps that I expected to be filled by doing:

df.resample('1min').max()

Out:
                    state
     2018-01-11 00:01:00    a
     2018-01-11 00:02:00    NaN
     2018-01-11 00:03:00    b
     2018-01-11 00:04:00    NaN
     2018-01-11 00:05:00    c

And obtaining the desired result with:

df.resample('1min').max().fillna(method='ffill')

Out:
                    state
     2018-01-11 00:01:00    a
     2018-01-11 00:02:00    a
     2018-01-11 00:03:00    b
     2018-01-11 00:04:00    b
     2018-01-11 00:05:00    c

Thanks a lot!

Quentin

Upvotes: 2

Views: 462

Answers (2)

Deena
Deena

Reputation: 6213

Forward fill is not really overwriting the existing value in this case.

Pandas assume that any NaN value should be equivalent to the previous value. In principle, 00:03:00 was indeed NaN (we have no entry for it), so it gets correctly filled with it's previous value a.

I believe the confusion is how the final minutely "aggregation" or the downsampling occurs (ie: how are the multiple values occurring during minute 3 combined). In this case, the value of the exact minute (00 seconds) is taken

The following might explain the process:

I have added an entry to your dataframe:

states = list('abcd')
transition_times = [
  pd.to_datetime('00:01:00'),
  pd.to_datetime('00:03:10'),
  pd.to_datetime('00:03:30'),
  pd.to_datetime('00:05:00'),
]

df = pd.DataFrame({'state': states}, index=transition_times);df

And here is a manual illustration:

df.resample("10s").asfreq()
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:01:10   NaN
# 2018-01-11 00:01:20   NaN
# 2018-01-11 00:01:30   NaN
# 2018-01-11 00:01:40   NaN
# 2018-01-11 00:01:50   NaN
# 2018-01-11 00:02:00   NaN
# 2018-01-11 00:02:10   NaN
# 2018-01-11 00:02:20   NaN
# 2018-01-11 00:02:30   NaN
# 2018-01-11 00:02:40   NaN
# 2018-01-11 00:02:50   NaN
# 2018-01-11 00:03:00   NaN
# 2018-01-11 00:03:10     b
# 2018-01-11 00:03:20   NaN
# 2018-01-11 00:03:30     c
# 2018-01-11 00:03:40   NaN
# 2018-01-11 00:03:50   NaN
# 2018-01-11 00:04:00   NaN
# 2018-01-11 00:04:10   NaN
# 2018-01-11 00:04:20   NaN
# 2018-01-11 00:04:30   NaN
# 2018-01-11 00:04:40   NaN
# 2018-01-11 00:04:50   NaN
# 2018-01-11 00:05:00     d

# Forward fill 
df_ffill = df.resample("10s").asfreq().ffill()
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:01:10     a
# 2018-01-11 00:01:20     a
# 2018-01-11 00:01:30     a
# 2018-01-11 00:01:40     a
# 2018-01-11 00:01:50     a
# 2018-01-11 00:02:00     a
# 2018-01-11 00:02:10     a
# 2018-01-11 00:02:20     a
# 2018-01-11 00:02:30     a
# 2018-01-11 00:02:40     a
# 2018-01-11 00:02:50     a
# 2018-01-11 00:03:00     a
# 2018-01-11 00:03:10     b
# 2018-01-11 00:03:20     b
# 2018-01-11 00:03:30     c
# 2018-01-11 00:03:40     c
# 2018-01-11 00:03:50     c
# 2018-01-11 00:04:00     c
# 2018-01-11 00:04:10     c
# 2018-01-11 00:04:20     c
# 2018-01-11 00:04:30     c
# 2018-01-11 00:04:40     c
# 2018-01-11 00:04:50     c
# 2018-01-11 00:05:00     d

# Manual Downsample
df_ffill[df_ffill.index.second == 0]
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:02:00     a
# 2018-01-11 00:03:00     a
# 2018-01-11 00:04:00     c
# 2018-01-11 00:05:00     d

# ----------------------------------------

df.resample("1min").ffill()
#                     state
# 2018-01-11 00:01:00     a
# 2018-01-11 00:02:00     a
# 2018-01-11 00:03:00     a
# 2018-01-11 00:04:00     c
# 2018-01-11 00:05:00     d

Upvotes: 1

Phil Sheard
Phil Sheard

Reputation: 2162

.ffill (ie forward fill) will use the last observation to fill forwards from the first item until it is replaced by a new observation. Moving forwards through your timestamps, at the 00:03:00 timestamp the "b" observation hasn't happened yet so it uses the last valid observation ("a").

The opposite of this method is .bfill (backfill) which works in the reverse order to fill backwards, eg:

In: df.resample('1min').bfill()
Out:
                     state
    2018-01-11 00:01:00     a
    2018-01-11 00:02:00     b
    2018-01-11 00:03:00     b
    2018-01-11 00:04:00     c
    2018-01-11 00:05:00     c

More details in the docs: pandas.DataFrame.fillna

Upvotes: 3

Related Questions