Reputation: 3844
I am working with the data which looks like a DataFrame described by
df = pd.DataFrame({'AAA': [1,1,1,2,2,2,3,3], 'BBB': [2,1,3,4,6,1,2,3]})
What I would like to do is to set value to roundup (90%) if value exceeds the 90th percentile. So it's like capping the maximum to the 90th percentile.
This is getting trickier for me as every column is going to have different percentile value.
I am able to get 90th percentile value using:
df.describe(percentiles=[.9])
So for column BBB, 6 is greater than 4.60 (90th percentile), hence it needs to be changed to 5 (roundup 4.60).
In my actual problem I am doing this for a large matrix, so I would like to know if there is any simple solution to this, rather than creating an array of 90th percentile of columns first and then checking elements in for a column and setting those to roundup to the 90th percentile.
Upvotes: 2
Views: 4270
Reputation: 176880
One vectorised method would be to combine np.minimum
and df.quantile
:
>>> np.minimum(df, df.quantile(0.9))
AAA BBB
0 1 2.0
1 1 1.0
2 1 3.0
3 2 4.0
4 2 4.6
5 2 1.0
6 3 2.0
7 3 3.0
For a bigger speed boost use:
np.minimum(df, np.percentile(df, 90, axis=0))
df.quantile
appears to be slower than np.percentile
(possibly because it returns a Series rather than a plain NumPy array).
Upvotes: 3
Reputation: 76927
One way to do this apply clip_upper()
on 90 percentile value np.percentile(x, 90)
for each column
In [242]: df.apply(lambda x: x.clip_upper(np.percentile(x, 90)))
Out[242]:
AAA BBB
0 1 2.0
1 1 1.0
2 1 3.0
3 2 4.0
4 2 4.6
5 2 1.0
6 3 2.0
7 3 3.0
I had imagined @ajcr elegant solution would faster than apply
. But,
Below benchmarks for len(df) ~ 130K
In [245]: %timeit df.apply(lambda x: x.clip_upper(np.percentile(x, 90)))
100 loops, best of 3: 7.49 ms per loop
In [246]: %timeit np.minimum(df, df.quantile(0.9))
100 loops, best of 3: 11.1 ms per loop
And for len(df) ~ 1M
In [248]: %timeit df.apply(lambda x: x.clip_upper(np.percentile(x, 90)))
10 loops, best of 3: 54.5 ms per loop
In [249]: %timeit np.minimum(df, df.quantile(0.9))
10 loops, best of 3: 73.9 ms per loop
Upvotes: 1