tbs
tbs

Reputation: 33

`numpy.nanpercentile` is extremely slow

numpy.nanpercentile is extremely slow. So, I wanted to use cupy.nanpercentile; but there is not cupy.nanpercentile implemented yet. Do someone have solution for it?

Upvotes: 1

Views: 1614

Answers (3)

JonasV
JonasV

Reputation: 1031

Here's an implementation with numba. After it's been compiled it is more than 7x faster than the numpy version.

Right now it is set up to take the percentile along the first axis, however it could be changed easily.

@numba.jit(nopython=True, cache=True)
def nan_percentile_axis0(arr, percentiles):
    """Faster implementation of np.nanpercentile
    
    This implementation always takes the percentile along axis 0.
    Uses numba to speed up the calculation by more than 7x.

    Function is equivalent to np.nanpercentile(arr, <percentiles>, axis=0)

    Params:
        arr (np.array): Array to calculate percentiles for
        percentiles (np.array): 1D array of percentiles to calculate

    Returns:
        (np.array) Array with first dimension corresponding to
            values as passed in percentiles

    """
    shape = arr.shape
    arr = arr.reshape((arr.shape[0], -1))
    out = np.empty((len(percentiles), arr.shape[1]))
    for i in range(arr.shape[1]):
        out[:,i] = np.nanpercentile(arr[:,i], percentiles)
    shape = (out.shape[0], *shape[1:])
    return out.reshape(shape)

Upvotes: 0

Stian Rostad
Stian Rostad

Reputation: 21

I also had a problem with np.nanpercentile being very slow for my datasets. I found a wokraround that lets you use the standard np.percentile. And it can also be applied to many other libs.

This one should solve your problem. And it also works alot faster than np.nanpercentile:

arr = np.array([[np.nan,2,3,1,2,3],
                [np.nan,np.nan,1,3,2,1],
                [4,5,6,7,np.nan,9]])

mask = (arr >= np.nanmin(arr)).astype(int)

count = mask.sum(axis=1)
groups = np.unique(count)
groups = groups[groups > 0]

p90 = np.zeros((arr.shape[0]))
for g in range(len(groups)):
    pos = np.where (count == groups[g])
    values = arr[pos]
    values = np.nan_to_num (values, nan=(np.nanmin(arr)-1))
    values = np.sort (values, axis=1)
    values = values[:,-groups[g]:]
    p90[pos] = np.percentile (values, 90, axis=1)

So instead of taking the percentile with the nans, it sorts the rows by the amount of valid data, and takes the percentile of those rows separated. Then adds everything back together. This also works for 3D-arrays, just add y_pos and x_pos instead of pos. And watch out for what axis you are calculating over.

Upvotes: 2

Mason Ji Ming
Mason Ji Ming

Reputation: 101

def testset_gen(num):
    init=[]
    for i in range (num):
        a=random.randint(65,122) # Dummy name
        b=random.randint(1,100) # Dummy value: 11~100 and 10% of nan
        if b<11:
            b=np.nan # 10% = nan
        init.append([a,b])
    return np.array(init)

np_testset=testset_gen(30000000) # 468,751KB

def f1_np (arr, num):
    return np.percentile (arr[:,1], num)
# 55.0, 0.523902416229248 sec

print (f1_np(np_testset[:,1], 50))

def cupy_nanpercentile (arr, num):
    return len(cp.where(arr > num)[0]) / (len(arr) - cp.sum(cp.isnan(arr))) * 100
    # 55.548758317136446, 0.3640251159667969 sec
    # 43% faster
    # If You need same result, use int(). But You lose saved time.

print (cupy_nanpercentile(cp_testset[:,1], 50))

I can't imagine How test result takes few days. With my computer, It seems 1 Trillion line of data or more. Because of this, I can't reproduce same problem due to lack of resource.

Upvotes: 0

Related Questions