Reputation: 1960
I have a dataframe:
d = [f1 f2 f3
1 2 3
5 1 2
3 3 1
2 4 7
.. .. ..]
I want to add, per feature, the percentile of the value for this feature in the row (for subset of features).
So for subset = [f1,f2,f3]
my dataframe will be
new_d =[f1 f2 f3 f1_per f2_per f3_per
1 2 3 0 0.25 0.5
5 1 2 1 0 0.25
3 3 1 0.5 0.5 0
2 4 5 0.25 0.75 1
4 5 4 0.75 1 0.75]
What is the best way to do so?
Upvotes: 0
Views: 117
Reputation: 863671
In ouput are 5 rows, in input are 4 rows, so output is different is use DataFrame.rank
with all columns and join back to original
df = df.join(df.rank(pct=True).add_prefix('pct'))
print (df)
f1 f2 f3 pctf1 pctf2 pctf3
0 1 2 3 0.2 0.4 0.6
1 5 1 2 1.0 0.2 0.4
2 3 3 1 0.6 0.6 0.2
3 2 4 7 0.4 0.8 1.0
4 4 5 4 0.8 1.0 0.8
If need rank with percentile by number of rows without 1
:
df = df.join(df.rank().sub(1).div(len(df) - 1).add_prefix('pct'))
print (df)
f1 f2 f3 pctf1 pctf2 pctf3
0 1 2 3 0.00 0.25 0.50
1 5 1 2 1.00 0.00 0.25
2 3 3 1 0.50 0.50 0.00
3 2 4 7 0.25 0.75 1.00
4 4 5 4 0.75 1.00 0.75
Upvotes: 2
Reputation: 61930
Here is another approach, doing explicitly what you want:
res = df.apply(lambda x: np.greater.outer(x.values, x.values).sum(axis=1) / (len(x) - 1))
res.columns = [f'{c}_per' for c in df.columns]
res = df.join(res)
print(res)
Output
f1 f2 f3 f1_per f2_per f3_per
0 1 2 3 0.00 0.25 0.50
1 5 1 2 1.00 0.00 0.25
2 3 3 1 0.50 0.50 0.00
3 2 4 7 0.25 0.75 1.00
4 4 5 4 0.75 1.00 0.75
Upvotes: 0
Reputation: 11532
A way to do this is the following:
df['pct_1'] = df.f1.rank(pct=True)
df['pct_f2'] = df.f2.rank(pct=True)
df['pct_f3'] = df.f3.rank(pct=True)
which gives:
f1 f2 f3 pct_1 pct_f2 pct_f3
0 1 2 3 0.25 0.50 0.75
1 5 1 2 1.00 0.25 0.50
2 3 3 1 0.75 0.75 0.25
3 2 4 7 0.50 1.00 1.00
Upvotes: 2