Calculate
Calculate

Reputation: 343

Rolling Percentile Function outputting 0's in column?

When creating a function, and using rolling( ) with the apply( ) to calculate a rolling 3 day percentile distribution, it is displaying 0's after the first 3 days for the rest of the Column.

I'm assuming that the first 2 days which have NaN Values are not being used in the calculation of the percentile function, and therefore maybe defaulting the rest of the columns to Zero, and incorrectly giving the 33 value for the third day. But im not sure about this.

I have been trying to solve this, but have not got any solution. Does anybody know why and how to solve correct this code below ? it would be greatly appreciated.

import pandas as pd
import numpy as np
from scipy import stats
data = { 'a': [1, 15, 27, 399, 17, 568, 200, 9], 
         'b': [2, 30, 15, 60, 15, 80, 53, 41],
         'c': [100,200, 3, 78, 25, 88, 300, 91],
         'd': [4, 300, 400, 500, 23, 43, 9, 71]
         }

dfgrass = pd.DataFrame(data)
def percnum(x):
    for t in dfgrass.index:
        aaa = (x<=dfgrass.loc[t,'b']).value_counts()
        ccc = (x<=dfgrass.loc[t, 'b']).values.sum()
        vvv = len(x)
        nnn = ccc/ vvv
        return nnn * 100

dfgrass['e'] = dfgrass['b'].rolling(window=3).apply(percnum)
print(dfgrass)

Upvotes: 1

Views: 248

Answers (4)

Adrian
Adrian

Reputation: 74

You can use pandas rolling function combined with quantile as follows. Enter any quantile value between 0 and 1 (ie your percentile/100). If you don't want Nans at the beginning, set min_periods to 1.

data = { 'a': [1, 15, 27, 399, 17, 568, 200, 9], 
         'b': [2, 30, 15, 60, 15, 80, 53, 41],
         'c': [100,200, 3, 78, 25, 88, 300, 91],
         'd': [4, 300, 400, 500, 23, 43, 9, 71]
         }
dfgrass = pd.DataFrame(data)
rolling_percentile=dfgrass.rolling(window=3,min_periods=1,center=False,axis=0).quantile(0.4)
print(rolling_percentile)

which gives the following output:

       a     b      c      d
0    1.0   2.0  100.0    4.0
1    6.6  13.2  140.0  122.4
2   12.2  12.4   80.6  240.8
3   24.6  27.0   63.0  380.0
4   25.0  15.0   20.6  324.6
5  322.6  51.0   67.4   39.0
6  163.4  45.4   75.4   20.2
7  161.8  50.6   90.4   36.2

Upvotes: 0

Fredaroo
Fredaroo

Reputation: 444

Another option for what you are attempting is to directly apply pandas' rank method with pct=True in your function. This will run the percentile method directly on the subset defined by the rolling window. This can be done like so:

def rolling_percentile(x):
    d = pd.DataFrame(x)
    d['rolling'] = d.rank(pct=True)
    return d.iloc[-1, 1]

Then you can insert that into your apply:

df['rolling_apply'] = df[column].rolling(window).apply(rolling_percentile)

Additional notes on the function: There are other ways to do this, but within the function I create a rolling column on subset x of the initial dataframe. Since for each x a window is passed with n amount of previous values. For example if you window is of three, a numpy array will be passed looking a little like this : [1, 15, 27]. Hence, the rolling percentage that interests us is the one of the last value of x relative to the values contained within the window. Therefore we get that value at position [-1, 1] which corresponds to the rolling column of the last value.

Upvotes: 1

Czar Yobero
Czar Yobero

Reputation: 115

If you're trying to compute the percentile ranks, then you can try something like

def percnum(x):
    n = len(x)
    temp = x.argsort()
    ranks = np.empty(n)
    ranks[temp] = (np.arange(n) + 1) / n
    return ranks[-1]

dfgrass.rolling(3).apply(percnum)

which gives the following output

          a         b         c         d
0       NaN       NaN       NaN       NaN
1       NaN       NaN       NaN       NaN
2  1.000000  0.666667  0.333333  1.000000
3  1.000000  1.000000  0.666667  1.000000
4  0.333333  0.666667  0.666667  0.333333
5  1.000000  1.000000  1.000000  0.666667
6  0.666667  0.666667  1.000000  0.333333
7  0.333333  0.333333  0.666667  1.000000

Upvotes: 1

Czar Yobero
Czar Yobero

Reputation: 115

Perhaps try changing for t in dfgrass.index to for t in x.index in your implementation of def percnum(x) like so:

def percnum(x):
    for t in x.index:
        aaa = (x<=dfgrass.loc[t,'b']).value_counts()
        ccc = (x<=dfgrass.loc[t, 'b']).values.sum()
        vvv = len(x)
        nnn = ccc/ vvv
        return nnn * 100

Upvotes: 1

Related Questions