Selecting rows/columns in numpy array based on another numpy array (performance)

Question

I have got two NumPy arrays. In my case Y contains an output and P the probability that this output is correct. Rows and columns are of the form (outputs, noOfAnswers) or (probability, noOfAnswers). So in general output is much bigger than noOfAnswers.

I am selecting the two most significant results concerning P by:

chooseThem = np.argpartition(P,-2,axis=1)[:,-2:]

Now I wish to create a new Array YP of the size (outputs, 2) with just the values specified by chooseThem. With a for loop this is straightforward but the performance is not OK.

Here is an example for the "bad approach" with some artificial arrays:

import numpy as np
Y = 4*(np.random.rand(1000,6)-0.5)
P = np.random.rand(1000,6)
biggest2 = np.argpartition(P,-2,axis=1)[:,-2:]
YNew = np.zeros((1000,2))

for j in range(2):
    for i in range(1000):
        YNew[i,j] = Y[i,biggest2[i,j]]

Does anyone have a suggestion for a fast way to create this new array?

DJK · Accepted Answer

This works for slicing the array

dex = np.array([np.arange(1000),np.arange(1000)]).T
YNew = Y[dex,biggest2]

with some testing (old = loop method new = index method)

1000 rows

%timeit new(Y,P,1000,biggest2)
The slowest run took 4.47 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 39.1 µs per loop

%timeit old(Y,P,1000,biggest2)
1000 loops, best of 3: 853 µs per loop

100000 rows

%timeit new(Y,P,100000,biggest2)
100 loops, best of 3: 4.49 ms per loop

%timeit old(Y,P,100000,biggest2)
10 loops, best of 3: 89.4 ms per loop

Selecting rows/columns in numpy array based on another numpy array (performance)

Answers (1)

Related Questions