Isaac Sullivan
Isaac Sullivan

Reputation: 77

calculate a rolling median on a dataframe that has a non-unique date index

I have the following dataframe, which has a non-unique index using dates:

                column
2019-01-07         NaN    
2019-01-08         NaN
2019-01-08        0.02
2019-01-09       31.45
2019-01-10         NaN
2019-01-10       71.87
2019-01-10       90.18
2019-01-11         NaN
2019-01-12       12.67
2019-01-12        5.68
2019-01-12       11.23
2019-01-12       21.67
2019-01-12       14.77
2019-01-12        5.18
2019-01-13       14.38
2019-01-13         NaN
2019-01-13       71.13
2019-01-13       20.02
2019-01-13      103.10
2019-01-14         NaN
2019-01-15       32.48
2019-01-16       37.37
2019-01-16       31.05
2019-01-16        7.00
2019-01-17         NaN
2019-01-17       39.65
2019-01-18       23.68
2019-01-18        0.08
2019-01-18       41.35
2019-01-19         NaN
2019-01-19       45.85
2019-01-19        3.98
2019-01-19        4.60
2019-01-19         NaN
2019-01-19         NaN
2019-01-20        3.60
2019-01-20        5.03
2019-01-20       15.70

My aim is to calculate a 7 day rolling median using all the values for each date, but ignoring the NaN values.

The resulting dataframe should have a unique date index with the rolling median as a column value for that date, similar to the following:

                column
2019-01-13       17.40    
2019-01-14       17.40
2019-01-15       20.85
2019-01-16       20.85
2019-01-17       20.02
2019-01-18       20.85
2019-01-19       31.05
2019-01-20       19.69

I am not sure how to easily achieve this using pandas, so if anyone can provide an answer or point me in the right direction, it will be very much appreciated.

EDIT

To make things clearer, I will explain how the median needs to be calculated for a single date.

Using the date of 2019-01-13, the rolling median will need to use all the values, excluding the NaN values, from the 7 days up to the 13th. This means that the values that need to be included in the median calculation for the 13th are 0.02, 31.45, 71.87, 90.18, 12.67, 5.68, 11.23, 21.67, 14.77, 5.18, 14.38, 71.13, 20.02, 103.10. The resulting median for the 13th would be 17.4.

Hope that helps.

Upvotes: 2

Views: 515

Answers (2)

Kris
Kris

Reputation: 23599

A rolling object is iterable, which allows for a solution like this:

# drop NAs and group by date into lists of values
df_per_date = df.dropna().groupby('date').apply(lambda g: g.value.to_list())


# compute medians across windows ('sum' concatenates multiple lists into one list)
medians = [np.median(window.agg(sum)) for window in df_per_date.rolling(5)]


# result
medians = pd.Series(index=df_per_date.index, data=medians)


medians

enter image description here


By the way, I loaded the data like this:

# load the data
df = pd.read_csv(pd.io.common.StringIO("""
2019-01-07         NaN    
2019-01-08         NaN
2019-01-08        0.02
2019-01-09       31.45
2019-01-10         NaN
2019-01-10       71.87
2019-01-10       90.18
2019-01-11         NaN
2019-01-12       12.67
2019-01-12        5.68
2019-01-12       11.23
2019-01-12       21.67
2019-01-12       14.77
2019-01-12        5.18
2019-01-13       14.38
2019-01-13         NaN
2019-01-13       71.13
2019-01-13       20.02
2019-01-13      103.10
2019-01-14         NaN
2019-01-15       32.48
2019-01-16       37.37
2019-01-16       31.05
2019-01-16        7.00
2019-01-17         NaN
2019-01-17       39.65
2019-01-18       23.68
2019-01-18        0.08
2019-01-18       41.35
2019-01-19         NaN
2019-01-19       45.85
2019-01-19        3.98
2019-01-19        4.60
2019-01-19         NaN
2019-01-19         NaN
2019-01-20        3.60
2019-01-20        5.03
2019-01-20       15.70
""".strip()), sep='\s+', names=['date', 'value'], parse_dates=['date'])

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150785

If the data is not too long, cross merge can work:

df['key'] = 1
df = df.reset_index()

(df.merge(df, on='key', suffixes=['','_'])
   .loc[lambda x: x['index'].ge(x['index_']) & 
                  x['index'].sub(x['index_']).le('7D')]
   .groupby('index')['column_'].median()
)

Output:

index
2019-01-07       NaN
2019-01-08     0.020
2019-01-09    15.735
2019-01-10    51.660
2019-01-11    51.660
2019-01-12    13.720
2019-01-13    17.395
2019-01-14    17.395
2019-01-15    20.020
2019-01-16    21.670
2019-01-17    21.670
2019-01-18    20.845
2019-01-19    20.020
2019-01-20    21.850
Name: column_, dtype: float64

Upvotes: 0

Related Questions