Richard Wong
Richard Wong

Reputation: 3

Python: Compare data against the 95th percentile of a running window dataset

I have a large DataFrame of thousands of rows but only 2 columns. The 2 columns are of the below format:

Dt Val
2020-01-01 10.5
2020-01-01 11.2
2020-01-01 10.9
2020-01-03 11.3
2020-01-05 12.0

The first column is date and the second column is a value. For each date, there may be zero, one or more values.

What I need to do is the following: Compute the 95th percentile based on the 30 days that just past and see if the current value is above or below that 95th percentile value. There must however be a minimum of 50 values available for the past 30 days.

For example, if a record has date "2020-12-01" and value "10.5", then I need to first see how many values are there available for the date range 2020-11-01 to 2020-11-30. If there are at least 50 values available over that date range, then I will want to compute the 95th percentile of those values and compare 10.5 against that. If 10.5 is greater than the 95th percentile value, then the result for that record is "Above Threshold". If 10.5 is less than the 95th percentile value, then the result for that record is "Below Threshold". If there are less than 50 values over the date range 2020-11-01 to 2020-11-30, then the result for that record is "Insufficient Data".

I would like to avoid running a loop if possible as it may be expensive from a resource and time perspective to loop through thousands of records to process them one by one. I hope someone can advise of a simple(r) python / pandas solution here.

Upvotes: 0

Views: 455

Answers (1)

Z Li
Z Li

Reputation: 4318

Use rolling on DatetimeIndex to get the number of values available and the 95th percentile in the last 30 days. Here is an example with 3 days rolling window:

import datetime
import pandas as pd

df = pd.DataFrame({'val':[1,2,3,4,5,6]},
                  index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
                           datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])

df.index = pd.DatetimeIndex(df.index)

df['number_of_values'] = df.rolling('3D').count()
df['rolling_percentile'] = df.rolling('3D')['val'].quantile(0.9, interpolation='nearest')

Then you can simply do your comparison:

# Above Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)

# Below Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)

# Insufficient Data
df['number_of_values']<50

To remove the current date, close argument would not work for more than one row on a day, so maybe use the rolling apply:

def f(x, metric):
    x = x[x.index!=x.index[-1]]
    if metric == 'count':
        return len(x)
    elif metric == 'percentile':
        return x.quantile(0.9, interpolation='nearest')
    else:
        return np.nan

df = pd.DataFrame({'val':[1,2,3,4,5,6]},
                  index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
                           datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])
df.index = pd.DatetimeIndex(df.index)
df['count'] = df.rolling('3D')['val'].apply(f, args = ('count',))
df['percentile'] = df.rolling('3D')['val'].apply(f, args = ('percentile',))
          val   count   percentile
2020-10-01  1   0.0  NaN
2020-10-01  2   0.0  NaN
2020-10-02  3   2.0  2.0
2020-10-03  4   3.0  3.0
2020-10-03  5   3.0  3.0
2020-10-04  6   3.0  5.0

Upvotes: 1

Related Questions