stellasia
stellasia

Reputation: 5612

Compute 3 months rolling median from pandas DataFrame with date index

I start from an input DataFrame looking like:

df = pd.DataFrame({"created_on":[datetime(2015, 1, 3),
                                 datetime(2015 , 1, 5),
                                 datetime(2015, 2, 24),
                                 datetime(2015, 3, 6),
                                 datetime(2015, 3, 17),
                                 datetime(2015, 5, 31),
                                 datetime(2015, 6, 3)],
           "value":[3, 2, 1, 1, 3, 2, 2]
           }
   )

ie

  created_on  value
0 2015-01-03      3
1 2015-01-05      2
2 2015-02-24      1
3 2015-03-06      1
4 2015-03-17      3
5 2015-05-31      2
6 2015-06-03      2

I would like to get, for each month, the median of the observations whose created_on is in the current month or within the past 2 months from the current month.

The expected output for the input data above is:

month        median_value
2015-01-01   2.5
2015-02-01   2
2015-03-01   2
2015-04-01   1
2015-05-01   2
2015-06-01   2

ie for "2015-01-01", only observations "2015-01-03" and "2015-01-05" were used, for month "2015-02-01", I take "2015-01-03", "2015-01-05" and "2015-02-24" and so on.

I wanted to extract the month from the date column and then use groupby like this:

 df['created_on_month'] = df['created_on'].apply(
                   lambda dt: datetime(dt.year, dt.month, 1)
 )
 df.groupby('created_on_month').median()

But I don't know how to aggregate within 3 months in groupby, in such a manner that one row can belong to several groups. Other problem with this solution is that empty months ("2015-04-01" in the example above) do not appear in the result.

I have also tried using rolling_median provided by pandas that could do the job but it goes with a resample that works fine when just counting the number of observations, but median of medians is not the same.

Finally, I could also use a simple loop like:

months = pd.date_range('2015-01-01', '2015-06-01', freq='MS')
output = pd.DataFrame(index=months, columns=("month", "median_value"))

for m in months:
    tmp = df [ (df.created_on >= (m - pd.DateOffset(months=2))) 
             & (df.created_on <= m+pd.DateOffset(months=1))]
    res = {"month":m, "median_value":tmp["value"].median()}
    output.loc[m] = res

print output 

that yields:

                          month median_value
2015-01-01  2015-01-01 00:00:00          2.5
2015-02-01  2015-02-01 00:00:00            2
2015-03-01  2015-03-01 00:00:00            2
2015-04-01  2015-04-01 00:00:00            1
2015-05-01  2015-05-01 00:00:00            2
2015-06-01  2015-06-01 00:00:00            2

but if a more elegant solution exists, I would be happy to learn it.

Upvotes: 3

Views: 3119

Answers (1)

JohnE
JohnE

Reputation: 30424

OK, this should be pretty close. I'm using a 90 day window b/c I'm not sure if I can do a 3 month window. Otherwise, it seems to work pretty well.

df2 = pd.rolling_apply( df.set_index('created_on')['value'], window=90, 
                        func=np.nanmedian, freq='d', min_periods=1 )

df2[ (df2.index.day == 1)[1:] ]   # [1:] is a kludge to get end of month
                                  # rather than beginning, probably a 
                                  # better way to do that...

created_on
2015-01-31    2.5
2015-02-28    2.0
2015-03-31    2.0
2015-04-30    1.0
2015-05-31    2.0

Note that because my method is different from yours, it tabs out as end of month rather than beginning of month, but that shouldn't affect the results in any way and I think end of month is actually more accurate anyway.

I'm not sure of the best way to get a result for June-2015 to print out, but it is stored correctly in df2 for June 3:

df2.tail(1)

created_on
2015-06-03    2

So it's just a matter of how best to extract and display the info. I suppose just padding out the original dataframe with a missing value for June 30 would be one way.

Upvotes: 1

Related Questions