parasu
parasu

Reputation: 973

Efficient numpy argsort with condition while maintaining original indices

I'm wondering what the most efficient way to do an argsort of an array given a condition, while preserving the original index

import numpy as np

x = np.array([0.63, 0.5, 0.7, 0.65])

np.argsort(x)
#Corrected argsort(x) solution
Out[99]: array([1, 0, 3, 2])

I want to argsort this array with condition that x>0.6. Since 0.5 < 0.6, index 1 should not be included.

x = np.array([0.63, 0.5, 0.7, 0.65])
index = x.argsort()
list(filter(lambda i: x[i] > 0.6, index))

[0,3,2]

This is inefficient since its not vectorized.

EDIT: The filter will eliminate most of elements. So ideally, it filter first, then sort, while preserving original index.

Upvotes: 8

Views: 4588

Answers (4)

AGN Gazer
AGN Gazer

Reputation: 8378

Method 1 (Same idea as Tai's method but using integer indexing)

Too late to the party too and if my solution is a repeat of an already posted solution - ping me and I will delete it.

def meth_agn_v1(x, thresh):
    idx = np.arange(x.size)[x > thresh]
    return idx[np.argsort(x[idx])]

Then,

In [143]: meth_agn_v1(x, 0.5)
Out[143]: array([0, 3, 2])

Method 2 (significant performance improvement)

This uses the same idea expressed in the last section of my answer (comparison to Tai's method) that integer indexing is faster than boolean indexing (for small number of expected elements to be selected) and avoids creating an initial index at all.

def meth_agn_v2(x, thresh):
    idx, = np.where(x > thresh)
    return idx[np.argsort(x[idx])]

Timing

In [144]: x = np.random.rand(100000)

In [145]: timeit meth_jp(x, 0.99)
100 loops, best of 3: 7.43 ms per loop

In [146]: timeit meth_alex(x, 0.99)
1000 loops, best of 3: 498 µs per loop

In [147]: timeit meth_tai(x, 0.99)
1000 loops, best of 3: 298 µs per loop

In [148]: timeit meth_agn_v1(x, 0.99)
1000 loops, best of 3: 232 µs per loop

In [161]: timeit meth_agn_v2(x, 0.99)
10000 loops, best of 3: 95 µs per loop

Comparison of v1 to Tai's method

My first version of the answer is very similar to Tai's answer but not identical.

Tai's method as published originally:

def meth_tai(x, thresh):
    y = np.arange(x.shape[0])
    y = y [x > thresh]  
    x = x [x > thresh] # x = x[y] is used in my method
    y[np.argsort(x)]

So, my method is different in using integer array indexing instead of the boolean indexing used by Tai. For a small number of selected elements integer indexing is faster than boolean indexing making this method more efficient than Tai's method even after Tai optimized his code.

Upvotes: 9

Tai
Tai

Reputation: 7994

Come a bit late to the party. The idea is that we can sort an array based on sorted indices of another array.

y = np.arange(x.shape[0]) # y for preserving the indices
mask = x > thresh
y = y[mask]  
x = x[mask]
ans = y[np.argsort(x)]    # change order of y based on sorted indices of x

The method is to add an array y that is just for recording the indices of x. We then filter out both arrays based on the boolean indexes x > thresh. Then, sort x with argsort. Finally, use the indices return by argsort to change the order of y!

Upvotes: 6

kmario23
kmario23

Reputation: 61325

Here is one more hackish approach which modifies the original array with some arbitrary maximum number which will not likely to occur in the original array.

In [50]: x = np.array([0.63, 0.5, 0.7, 0.65])
In [51]: invmask = ~(x > 0.6)

# replace it with some huge number which will not occur in your original array
In [52]: x[invmask] = 9999.0

In [53]: np.argsort(x)[:-sum(invmask)]
Out[53]: array([0, 3, 2])

Upvotes: 0

Alex
Alex

Reputation: 19104

Method 1 (@jp_data_analysis answer)

You should use this one unless you have reason not to.

def meth1(x, thresh):
    return np.argsort(x)[(x <= thresh).sum():]

Method 2

If the filter will greatly reduce the number of elements in the array and the array is large then following may help:

def meth2(x, thresh):
    m = x > thresh
    idxs = np.argsort(x[m])
    offsets = (~m).cumsum()
    return idxs + offsets[m][idxs]

Speed comparison

x = np.random.rand(10000000)

%timeit meth1(x, 0.99)
# 2.81 s ± 244 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit meth2(x, 0.99)
# 104 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 3

Related Questions