Reputation: 155
I have an array A
(of size m x n
), and a percentage p
in [0,1]
. I need to produce an m x n
boolean array B
, with True
in in the (i,j)
entry if A[i,j]
is in p^{th}
quantile of the column A[:,j]
.
Here is the code I have used so far.
import numpy as np
m = 200
n = 300
A = np.random.rand(m, n)
p = 0.3
quant_levels = np.zeros(n)
for i in range(n):
quant_levels[i] = np.quantile(A[:,i],p)
B = np.array(A >= quant_levels)
Upvotes: 2
Views: 82
Reputation: 11603
I'm not sure it's much faster but you should at least be aware that numpy.quantile has an axis
keyword argument so you can compute all the quantiles with one command:
quant_levels = np.quantile(A, p, axis=0)
B = (A >= quant_levels)
Upvotes: 3
Reputation: 50279
A simple way to make this code faster is to run it in parallel using Numba. This also reduce a lot the Numpy overheads which seem to be the bottleneck here.
import numba as nb
@nb.njit('(float64[:,:], float64)', parallel=True)
def compute_quantiles(A, p):
quant_levels = np.empty(n)
for i in nb.prange(n):
quant_levels[i] = np.quantile(A[:,i],p)
return quant_levels
B = np.array(A >= compute_quantiles(A, p))
On my machine (i5-9600KF CPU with 6 core), this solution takes 0.23 ms as opposed to 29 ms for the initial Numpy code. This is about 130 times faster.
Note that the compilation of the function takes few second on my machine. If this is too expensive, then the alternative is to write a code using Cython. However, note that this alternative requires you to write the np.quantile
function (since Cython cannot speed Numpy functions up).
A faster solution consists in writing a SIMD-friendly quantile implementation (e.g. based on a Bitonic sort). However, this is not really possible to do that (easily) in Numba. Thus, one certainly need to implement this a native language (supporting SIMD operations). Note that doing this efficient is not a simple task, even for skilled developers.
Upvotes: 3