Ingo Weiss
Ingo Weiss

Reputation: 21

Rolling median of date-indexed data with duplicate dates

My date-indexed data can have multiple observations for a given date. I want to get the rolling median of a value but am not getting the result that I am looking for:

df = pd.DataFrame({
    'date': ['2020-06-22', '2020-06-23','2020-06-24','2020-06-24', '2020-06-25', '2020-06-26'],
    'value': [2,8,5,1,3,7]
})
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

enter image description here

# Attempt to get the 3-day rolling median of 'value':
df['value'].rolling('3D').median()

# This yields the following, i.e. one median value
# per **observation**
# (two values for 6/24 in this example):
date
2020-06-22    2.0
2020-06-23    5.0
2020-06-24    5.0
2020-06-24    3.5
2020-06-25    4.0
2020-06-26    4.0
Name: value, dtype: float64

# I was hoping to get one median value
# per **distinct date** in the index
# The median for 6/24, for example, would be computed
# from **all** observations on 6/22, 6/23 and 6/24(2 observations)
date
2020-06-22    NaN
2020-06-23    NaN
2020-06-24    3.5
2020-06-25    4.0
2020-06-26    4.0
Name: value, dtype: float64

How do I need to change my code?

Upvotes: 1

Views: 492

Answers (1)

piterbarg
piterbarg

Reputation: 8219

As far as I can tell, your code produces the right answer for the second occurrence of 2020-06-24, as 3.5 is the median of 4 numbers 2,8,5,1. The first occurrence of 2020-06-24only uses its own value and the ones from the two prior days. Presumably, and I am speculating here, it is looking at the '3D' window in the elements in the rows preceding it in the timeseries, not following.

So I think your code only needs a small modification to satisfy your requirement and that is if there are multiple rows with the same date we should just pick the last one. We will do this below with groupby. Also you want the first two values to be NaN rather than medians of shorter time series -- this can be achieved by passing min_periods = 3 in the rolling function. Here is all the code, I put the median into its own column

df['median'] = df['value'].rolling('3D', min_periods = 3).median()
df.groupby(level = 0, axis = 0).last()

prints


         value  median
date        
2020-06-22  2   NaN
2020-06-23  8   NaN
2020-06-24  1   3.5
2020-06-25  3   4.0
2020-06-26  7   4.0

Upvotes: 1

Related Questions