Reputation: 343
When creating a function, and using rolling( ) with the apply( ) to calculate a rolling 3 day percentile distribution, it is displaying 0's after the first 3 days for the rest of the Column.
I'm assuming that the first 2 days which have NaN Values are not being used in the calculation of the percentile function, and therefore maybe defaulting the rest of the columns to Zero, and incorrectly giving the 33 value for the third day. But im not sure about this.
I have been trying to solve this, but have not got any solution. Does anybody know why and how to solve correct this code below ? it would be greatly appreciated.
import pandas as pd
import numpy as np
from scipy import stats
data = { 'a': [1, 15, 27, 399, 17, 568, 200, 9],
'b': [2, 30, 15, 60, 15, 80, 53, 41],
'c': [100,200, 3, 78, 25, 88, 300, 91],
'd': [4, 300, 400, 500, 23, 43, 9, 71]
}
dfgrass = pd.DataFrame(data)
def percnum(x):
for t in dfgrass.index:
aaa = (x<=dfgrass.loc[t,'b']).value_counts()
ccc = (x<=dfgrass.loc[t, 'b']).values.sum()
vvv = len(x)
nnn = ccc/ vvv
return nnn * 100
dfgrass['e'] = dfgrass['b'].rolling(window=3).apply(percnum)
print(dfgrass)
Upvotes: 1
Views: 248
Reputation: 74
You can use pandas rolling
function combined with quantile
as follows. Enter any quantile value between 0 and 1 (ie your percentile/100). If you don't want Nans at the beginning, set min_periods
to 1.
data = { 'a': [1, 15, 27, 399, 17, 568, 200, 9],
'b': [2, 30, 15, 60, 15, 80, 53, 41],
'c': [100,200, 3, 78, 25, 88, 300, 91],
'd': [4, 300, 400, 500, 23, 43, 9, 71]
}
dfgrass = pd.DataFrame(data)
rolling_percentile=dfgrass.rolling(window=3,min_periods=1,center=False,axis=0).quantile(0.4)
print(rolling_percentile)
which gives the following output:
a b c d
0 1.0 2.0 100.0 4.0
1 6.6 13.2 140.0 122.4
2 12.2 12.4 80.6 240.8
3 24.6 27.0 63.0 380.0
4 25.0 15.0 20.6 324.6
5 322.6 51.0 67.4 39.0
6 163.4 45.4 75.4 20.2
7 161.8 50.6 90.4 36.2
Upvotes: 0
Reputation: 444
Another option for what you are attempting is to directly apply pandas' rank
method with pct=True
in your function. This will run the percentile method directly on the subset defined by the rolling window. This can be done like so:
def rolling_percentile(x):
d = pd.DataFrame(x)
d['rolling'] = d.rank(pct=True)
return d.iloc[-1, 1]
Then you can insert that into your apply:
df['rolling_apply'] = df[column].rolling(window).apply(rolling_percentile)
Additional notes on the function: There are other ways to do this, but within the function I create a rolling
column on subset x
of the initial dataframe. Since for each x
a window is passed with n amount of previous values. For example if you window is of three, a numpy array will be passed looking a little like this : [1, 15, 27]
.
Hence, the rolling percentage that interests us is the one of the last value of x
relative to the values contained within the window. Therefore we get that value at position [-1, 1] which corresponds to the rolling
column of the last value.
Upvotes: 1
Reputation: 115
If you're trying to compute the percentile ranks, then you can try something like
def percnum(x):
n = len(x)
temp = x.argsort()
ranks = np.empty(n)
ranks[temp] = (np.arange(n) + 1) / n
return ranks[-1]
dfgrass.rolling(3).apply(percnum)
which gives the following output
a b c d
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 1.000000 0.666667 0.333333 1.000000
3 1.000000 1.000000 0.666667 1.000000
4 0.333333 0.666667 0.666667 0.333333
5 1.000000 1.000000 1.000000 0.666667
6 0.666667 0.666667 1.000000 0.333333
7 0.333333 0.333333 0.666667 1.000000
Upvotes: 1
Reputation: 115
Perhaps try changing for t in dfgrass.index
to for t in x.index
in your implementation of def percnum(x)
like so:
def percnum(x):
for t in x.index:
aaa = (x<=dfgrass.loc[t,'b']).value_counts()
ccc = (x<=dfgrass.loc[t, 'b']).values.sum()
vvv = len(x)
nnn = ccc/ vvv
return nnn * 100
Upvotes: 1