FLab
FLab

Reputation: 7476

Pandas, unexpected behavior of .resample('B')

I start from monthly series time-stamped as the end of the month. I want to upsample them to business (Mon-Fri) daily frequency by filling forward values. I want 2 conditions to be true:

  1. Never lose a value in resampling if it was a weekend in the original time series
  2. Always forward fill: if the EOM date in the original series is Saturday, I want that observation to appear as next Monday in the daily series

Although not elegant I concluded that the safest method to do this is:

    daily_series = monthly_series.resample(rule='D').ffill().resample(rule='B',how='first')

Now, the unexpected case:

    dates = ['1953-02-28', '1953-03-31', '1953-04-30', '1953-05-31']
    # '1953-02-28' was a Saturday

    values = [1,2,3,4]
    monthly_ts  = pd.Series(values, index = dates)

    monthly_ts
    Out[74]: 
    1953-02-28    1
    1953-03-31    2
    1953-04-30    3
    1953-05-31    4
    dtype: int64

    daily_ts = monthly_ts.resample(rule='D').ffill().resample(rule='B',how='first')

     Out[77]: 
     1953-02-27    1           # Why do I have this observation?
     1953-03-02    1
     1953-03-03    1
     1953-03-04    1

The observation of Saturday was used for Friday in the resampling. This happens after .resample(rule = 'B')

Can you please explain to me why this is happening and how to prevent this?

Upvotes: 0

Views: 481

Answers (1)

knagaev
knagaev

Reputation: 2957

This way of behavior happens by reason of making periods for downsampling. The quantity of business days in the interval smaller than calendar days. That's why Fridays and Mondays are joined to Saturday and Sunday and represented as a single unit. Values from Saturday and Sunday used for downsampling with 'how' and 'closed' parameters.

dates = ['1953-02-28', '1953-03-31', '1953-04-30', '1953-05-31']
values = [1,2,3,4]
monthly_ts  = pd.Series(values, index = pd.to_datetime(dates))

At first upsample to calendar days

calendar_daily_ts = monthly_ts.resample(rule='D').ffill()

Let's look at the last three records

In[8]: calendar_daily_ts[-3:]
Out[8]: 
1953-05-29    3
1953-05-30    3
1953-05-31    4
Freq: D, dtype: int64

If we downsample to business days with mean and closed='left' the last value will be 3.33333

In [15]: calendar_daily_ts.resample(rule='B', closed='left').mean()[-2:]
Out[15]: 
1953-05-28    3.000000
1953-05-29    3.333333
Freq: B, dtype: float64

Value at Friday (1953-05-29) is calculated as mean of values at Friday, Saturday and Sunday (3 + 3 + 4) / 3

If we downsample to business days with mean and closed='right' the last value will be 3.5

In [16]: calendar_daily_ts.resample(rule='B', closed='right').mean()[-2:]
Out[16]: 
1953-05-28    3.0
1953-05-29    3.5
Freq: B, dtype: float64

Value at Friday (1953-05-29) is calculated as mean of values at Saturday, Sunday and next Monday (3 + 4 + 0) / 2

That is without the Friday value.

So the observation in your question appeared because for 1953-02-28 (Saturday) the period for downsampling consists of 1953-02-27, 1953-02-28, 1953-03-01 and 1953-03-02. Intervals by default are closed left - (1953-02-27, 1953-02-28, 1953-03-01). You got the first - it's 1953-02-27

One more interesting example

In [45]: calendar_daily_ts[:4]
Out[45]: 
1953-02-27    1
1953-02-28    2
1953-03-01    3
1953-03-02    4
dtype: int64

In [47]: calendar_daily_ts.resample(rule='B', closed='left').first()[:4]
Out[47]: 
1953-02-27    1
1953-03-02    4
1953-03-03    1
1953-03-04    1
Freq: B, dtype: int64

In [48]: calendar_daily_ts.resample(rule='B', closed='right').first()[:4]
Out[48]: 
1953-02-26    1
1953-02-27    2
1953-03-02    1
1953-03-03    1
Freq: B, dtype: int64

See the difference! (1953-02-26 with value 1)

Upvotes: 2

Related Questions