Flavio Moraes
Flavio Moraes

Reputation: 1351

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:

def random_dates(start='2018-01-01', end='2019-01-01', n=300):

    start_u = start.value//10**9
    end_u = end.value//10**9
    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')

dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})

print(df.resample('M', on='Month').mean())

The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.

So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?

Upvotes: 1

Views: 1937

Answers (2)

itaishz
itaishz

Reputation: 711

Edit: Not a working solution for OP's request. See short discussion in the comments.

Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).

The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.

df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm

           Integers      
                sum count
Month                    
2018-01-01      876    16
2018-01-15      864    16
2018-02-01      412    10
2018-02-15      626    12
...
2018-12-01      492    10
2018-12-15      638    16

Rolling sum and rolling count; Find the mean out of them:

df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']

df_sm
           Integers       count_sum count_rolling       mean
                sum count                                   
Month                                                       
2018-01-01      876    16       NaN           NaN        NaN
2018-01-15      864    16    1740.0          32.0  54.375000
2018-02-01      412    10    1276.0          26.0  49.076923
2018-02-15      626    12    1038.0          22.0  47.181818
...
2018-12-01      492    10    1556.0          27.0  57.629630
2018-12-15      638    16    1130.0          26.0  43.461538

Now, just filter the odd indices of df_sm:

df_sm.iloc[1::2]['mean']

Month
2018-01-15    54.375000
2018-02-15    47.181818
2018-03-15    51.000000
2018-04-15    44.897436
2018-05-15    52.450000
2018-06-15    33.722222
2018-07-15    41.277778
2018-08-15    46.391304
2018-09-15    45.631579
2018-10-15    54.107143
2018-11-15    58.058824
2018-12-15    43.461538
Freq: 2SMS-15, Name: mean, dtype: float64

The code:

df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']

Edit: Changed a name of one of the columns to make it clearer

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 30971

Assume that the source DataFrame is:

       Month  Amount
0 2020-05-05       1
1 2020-05-14       1
2 2020-05-15      10
3 2020-05-20      10
4 2020-05-30      10
5 2020-06-15      20
6 2020-06-20      20

To compute your "shifted" resample, first shift Month column so that the 15-th day of month becomes the 1-st:

df.Month = df.Month - pd.Timedelta('14D')

and then resample:

res = df.resample('M', on='Month').mean()

The result is:

            Amount
Month             
2020-04-30       1
2020-05-31      10
2020-06-30      20

If you want, change dates in the index to month periods:

res.index = res.index.to_period('M')

Then the result will be:

         Amount
Month          
2020-04       1
2020-05      10
2020-06      20

Upvotes: 1

Related Questions