Yantraguru
Yantraguru

Reputation: 3844

Set value to 90th percentile for each column in a DataFrame

I am working with the data which looks like a DataFrame described by

df = pd.DataFrame({'AAA': [1,1,1,2,2,2,3,3], 'BBB': [2,1,3,4,6,1,2,3]})

What I would like to do is to set value to roundup (90%) if value exceeds the 90th percentile. So it's like capping the maximum to the 90th percentile.

This is getting trickier for me as every column is going to have different percentile value.

I am able to get 90th percentile value using:

df.describe(percentiles=[.9])

So for column BBB, 6 is greater than 4.60 (90th percentile), hence it needs to be changed to 5 (roundup 4.60).

In my actual problem I am doing this for a large matrix, so I would like to know if there is any simple solution to this, rather than creating an array of 90th percentile of columns first and then checking elements in for a column and setting those to roundup to the 90th percentile.

Upvotes: 2

Views: 4270

Answers (2)

Alex Riley
Alex Riley

Reputation: 176880

One vectorised method would be to combine np.minimum and df.quantile:

>>> np.minimum(df, df.quantile(0.9))
   AAA  BBB
0    1  2.0
1    1  1.0
2    1  3.0
3    2  4.0
4    2  4.6
5    2  1.0
6    3  2.0
7    3  3.0

For a bigger speed boost use:

np.minimum(df, np.percentile(df, 90, axis=0))

df.quantile appears to be slower than np.percentile (possibly because it returns a Series rather than a plain NumPy array).

Upvotes: 3

Zero
Zero

Reputation: 76927

One way to do this apply clip_upper() on 90 percentile value np.percentile(x, 90) for each column

In [242]: df.apply(lambda x: x.clip_upper(np.percentile(x, 90)))
Out[242]:
   AAA  BBB
0    1  2.0
1    1  1.0
2    1  3.0
3    2  4.0
4    2  4.6
5    2  1.0
6    3  2.0
7    3  3.0

I had imagined @ajcr elegant solution would faster than apply. But,

Below benchmarks for len(df) ~ 130K

In [245]: %timeit df.apply(lambda x: x.clip_upper(np.percentile(x, 90)))
100 loops, best of 3: 7.49 ms per loop

In [246]: %timeit np.minimum(df, df.quantile(0.9))
100 loops, best of 3: 11.1 ms per loop

And for len(df) ~ 1M

In [248]: %timeit df.apply(lambda x: x.clip_upper(np.percentile(x, 90)))
10 loops, best of 3: 54.5 ms per loop

In [249]: %timeit np.minimum(df, df.quantile(0.9))
10 loops, best of 3: 73.9 ms per loop

Upvotes: 1

Related Questions