calculate a rolling median on a dataframe that has a non-unique date index

Question

I have the following dataframe, which has a non-unique index using dates:

                column
2019-01-07         NaN    
2019-01-08         NaN
2019-01-08        0.02
2019-01-09       31.45
2019-01-10         NaN
2019-01-10       71.87
2019-01-10       90.18
2019-01-11         NaN
2019-01-12       12.67
2019-01-12        5.68
2019-01-12       11.23
2019-01-12       21.67
2019-01-12       14.77
2019-01-12        5.18
2019-01-13       14.38
2019-01-13         NaN
2019-01-13       71.13
2019-01-13       20.02
2019-01-13      103.10
2019-01-14         NaN
2019-01-15       32.48
2019-01-16       37.37
2019-01-16       31.05
2019-01-16        7.00
2019-01-17         NaN
2019-01-17       39.65
2019-01-18       23.68
2019-01-18        0.08
2019-01-18       41.35
2019-01-19         NaN
2019-01-19       45.85
2019-01-19        3.98
2019-01-19        4.60
2019-01-19         NaN
2019-01-19         NaN
2019-01-20        3.60
2019-01-20        5.03
2019-01-20       15.70

My aim is to calculate a 7 day rolling median using all the values for each date, but ignoring the NaN values.

The resulting dataframe should have a unique date index with the rolling median as a column value for that date, similar to the following:

                column
2019-01-13       17.40    
2019-01-14       17.40
2019-01-15       20.85
2019-01-16       20.85
2019-01-17       20.02
2019-01-18       20.85
2019-01-19       31.05
2019-01-20       19.69

I am not sure how to easily achieve this using pandas, so if anyone can provide an answer or point me in the right direction, it will be very much appreciated.

EDIT

To make things clearer, I will explain how the median needs to be calculated for a single date.

Using the date of 2019-01-13, the rolling median will need to use all the values, excluding the NaN values, from the 7 days up to the 13th. This means that the values that need to be included in the median calculation for the 13th are 0.02, 31.45, 71.87, 90.18, 12.67, 5.68, 11.23, 21.67, 14.77, 5.18, 14.38, 71.13, 20.02, 103.10. The resulting median for the 13th would be 17.4.

Hope that helps.

Kris · Accepted Answer

A rolling object is iterable, which allows for a solution like this:

# drop NAs and group by date into lists of values
df_per_date = df.dropna().groupby('date').apply(lambda g: g.value.to_list())


# compute medians across windows ('sum' concatenates multiple lists into one list)
medians = [np.median(window.agg(sum)) for window in df_per_date.rolling(5)]


# result
medians = pd.Series(index=df_per_date.index, data=medians)


medians

By the way, I loaded the data like this:

# load the data
df = pd.read_csv(pd.io.common.StringIO("""
2019-01-07         NaN    
2019-01-08         NaN
2019-01-08        0.02
2019-01-09       31.45
2019-01-10         NaN
2019-01-10       71.87
2019-01-10       90.18
2019-01-11         NaN
2019-01-12       12.67
2019-01-12        5.68
2019-01-12       11.23
2019-01-12       21.67
2019-01-12       14.77
2019-01-12        5.18
2019-01-13       14.38
2019-01-13         NaN
2019-01-13       71.13
2019-01-13       20.02
2019-01-13      103.10
2019-01-14         NaN
2019-01-15       32.48
2019-01-16       37.37
2019-01-16       31.05
2019-01-16        7.00
2019-01-17         NaN
2019-01-17       39.65
2019-01-18       23.68
2019-01-18        0.08
2019-01-18       41.35
2019-01-19         NaN
2019-01-19       45.85
2019-01-19        3.98
2019-01-19        4.60
2019-01-19         NaN
2019-01-19         NaN
2019-01-20        3.60
2019-01-20        5.03
2019-01-20       15.70
""".strip()), sep='\s+', names=['date', 'value'], parse_dates=['date'])

calculate a rolling median on a dataframe that has a non-unique date index

Answers (2)

Related Questions