Set value to 90th percentile for each column in a DataFrame

Question

I am working with the data which looks like a DataFrame described by

df = pd.DataFrame({'AAA': [1,1,1,2,2,2,3,3], 'BBB': [2,1,3,4,6,1,2,3]})

What I would like to do is to set value to roundup (90%) if value exceeds the 90th percentile. So it's like capping the maximum to the 90th percentile.

This is getting trickier for me as every column is going to have different percentile value.

I am able to get 90th percentile value using:

df.describe(percentiles=[.9])

So for column BBB, 6 is greater than 4.60 (90th percentile), hence it needs to be changed to 5 (roundup 4.60).

In my actual problem I am doing this for a large matrix, so I would like to know if there is any simple solution to this, rather than creating an array of 90th percentile of columns first and then checking elements in for a column and setting those to roundup to the 90th percentile.

Alex Riley · Accepted Answer

One vectorised method would be to combine np.minimum and df.quantile:

>>> np.minimum(df, df.quantile(0.9))
   AAA  BBB
0    1  2.0
1    1  1.0
2    1  3.0
3    2  4.0
4    2  4.6
5    2  1.0
6    3  2.0
7    3  3.0

For a bigger speed boost use:

np.minimum(df, np.percentile(df, 90, axis=0))

df.quantile appears to be slower than np.percentile (possibly because it returns a Series rather than a plain NumPy array).

Set value to 90th percentile for each column in a DataFrame

Answers (2)

Related Questions