Reputation: 13997
I have a Pandas dataframe in which each column represents a separate property, and each row holds the properties' value on a specific date:
import pandas as pd
dfstr = \
''' AC BO C CCM CL CRD CT DA GC GF
2010-01-19 0.844135 -0.194530 -0.231046 0.245615 -0.581238 -0.593562 0.057288 0.655903 0.823997 0.221920
2010-01-20 -0.204845 -0.225876 0.835611 -0.594950 -0.607364 0.042603 0.639168 0.816524 0.210653 0.237833
2010-01-21 0.824852 -0.216449 -0.220136 0.234343 -0.611756 -0.624060 0.028295 0.622516 0.811741 0.201083'''
df = pd.read_csv(pd.compat.StringIO(dfstr), sep='\s+')
Using the rank
method, I can find the percentile rank of each property with respect to a specific date:
df.rank(axis=1, pct=True)
Output:
AC BO C CCM CL CRD CT DA GC GF
2010-01-19 1.0 0.4 0.3 0.7 0.2 0.1 0.5 0.8 0.9 0.6
2010-01-20 0.4 0.3 1.0 0.2 0.1 0.5 0.8 0.9 0.6 0.7
2010-01-21 1.0 0.4 0.3 0.7 0.2 0.1 0.5 0.8 0.9 0.6
What I'd like to get instead is the quantile (eg quartile, quintile, decile, etc) rank of each property. For example, for quintile rank my desired output would be:
AC BO C CCM CL CRD CT DA GC GF
2010-01-19 5 2 2 4 1 1 3 4 5 3
2010-01-20 2 2 5 1 1 3 4 5 3 4
2010-01-21 5 2 2 4 1 1 3 4 5 3
I might be missing something, but there doesn't seem to a built-in way to do this kind of quantile ranking with Pandas. What's the simplest way to get my desired output?
Upvotes: 3
Views: 3407
Reputation: 11
You can use now pd.qcut
df.apply(lambda x: pd.qcut(x, 5, labels=False)+1, axis=1)
Completed test case code
import pandas as pd
from io import StringIO
dfstr = \
''' AC BO C CCM CL CRD CT DA GC GF
2010-01-19 0.844135 -0.194530 -0.231046 0.245615 -0.581238 -0.593562 0.057288 0.655903 0.823997 0.221920
2010-01-20 -0.204845 -0.225876 0.835611 -0.594950 -0.607364 0.042603 0.639168 0.816524 0.210653 0.237833
2010-01-21 0.824852 -0.216449 -0.220136 0.234343 -0.611756 -0.624060 0.028295 0.622516 0.811741 0.201083'''
df = pd.read_csv(StringIO(dfstr), sep='\s+')
print('input:','\n',df)
input
AC BO C CCM CL CRD
2010-01-19 0.844135 -0.194530 -0.231046 0.245615 -0.581238 -0.593562 \
2010-01-20 -0.204845 -0.225876 0.835611 -0.594950 -0.607364 0.042603
2010-01-21 0.824852 -0.216449 -0.220136 0.234343 -0.611756 -0.624060
CT DA GC GF
2010-01-19 0.057288 0.655903 0.823997 0.221920
2010-01-20 0.639168 0.816524 0.210653 0.237833
2010-01-21 0.028295 0.622516 0.811741 0.201083
df_out = df.apply(lambda x: pd.qcut(x, 5, labels=False)+1, axis=1)
print('\n','output:','\n', df_out)
output
AC BO C CCM CL CRD CT DA GC GF
2010-01-19 5 2 2 4 1 1 3 4 5 3
2010-01-20 2 2 5 1 1 3 4 5 3 4
2010-01-21 5 2 2 4 1 1 3 4 5 3
Upvotes: 0
Reputation: 42946
mul
& np.ceil
You were quite close with the rank. Just multiplying by 5 with .mul
to get the desired quantile, also rounding up with np.ceil
:
np.ceil(df.rank(axis=1, pct=True).mul(5))
Output
AC BO C CCM CL CRD CT DA GC GF
2010-01-19 5.0 2.0 2.0 4.0 1.0 1.0 3.0 4.0 5.0 3.0
2010-01-20 2.0 2.0 5.0 1.0 1.0 3.0 4.0 5.0 3.0 4.0
2010-01-21 5.0 2.0 2.0 4.0 1.0 1.0 3.0 4.0 5.0 3.0
If you want integers use astype
:
np.ceil(df.rank(axis=1, pct=True).mul(5)).astype(int)
Or even better
Since pandas version 0.24.0 we have nullable integer type: Int64
.
So we can use :
np.ceil(df.rank(axis=1, pct=True).mul(5)).astype('Int64')
Output
AC BO C CCM CL CRD CT DA GC GF
2010-01-19 5 2 2 4 1 1 3 4 5 3
2010-01-20 2 2 5 1 1 3 4 5 3 4
2010-01-21 5 2 2 4 1 1 3 4 5 3
scipy.stats.percentileofscore
d = df.apply(lambda x: [np.ceil(stats.percentileofscore(x, a, 'rank')*0.05) for a in x], axis=1).values
pd.DataFrame(data=np.concatenate(d).reshape(d.shape[0], len(d[0])),
columns=df.columns,
dtype='int',
index=df.index)
Output
AC BO C CCM CL CRD CT DA GC GF
2010-01-19 5 2 2 4 1 1 3 4 5 3
2010-01-20 2 2 5 1 1 3 4 5 3 4
2010-01-21 5 2 2 4 1 1 3 4 5 3
Upvotes: 6