numpy - select the rows where a column is equal in two arrays

Question

I have two arrays:

a = 
[[ 461.  0.  ]
 [ 480.  15. ]
 [ 463.  28. ]]

and

b = 
[[ 463.  0.  ]
 [ 462.  8.  ]
 [ 466.  15. ]
 [ 469.  22. ]
 [ 470.  28. ]
 [ 473.  34. ]]

I need a resulting array comprised of a minus b only if the second column of a => [0 15 28] is in the second column of b => [0 8 15 22 28 34]. All elements of the second column of a will be in the second column of b, I just want to discard those in b that don't exist in a. The expected result is:

result =
[[  -2.  0.  ]
 [  14.  15. ]
 [  -7.  28. ]]

To begin, I thought of getting the 'subarray' of b that contains just the rows I'm interested in. Among many other things, the one I thought would work (and didn't) was this:

result = b[b[:, 1] in a[:, 1]] # not working

Any help is welcome.

rayryeng · Accepted Answer

This algorithm works under the following assumptions:

The second column of a is a subset of the second column of b. This means that we are guaranteed to find a value in the second column of b given a value in the second column of a.
The second columns of a and b are sorted.
There are no duplicate values in the second column shared between a and b.

Use numpy.in1d to figure out if the corresponding value in the second column of b can be found in a. You can then use this Boolean array to slice into b and do your subtraction with the first column of a and the first column of sliced result of b. The reason why this works is because of the nature of the sorted order in b. When slicing into this array in conjunction with numpy.in1d, you are guaranteed to have the second column of this sliced result match up exactly in values with the first column of a. Once you have this alignment, you can subtract the first column of this sliced result with the first column of a. To finish things up, you can copy over the second column of the sliced values of b and stack both of these together:

In [119]: import numpy as np

In [120]: a = np.array([[461,0],[480,15],[463,28]], dtype=np.float)

In [121]: b = np.array([[463,0], [462,8], [466,15], [469,22], [470,28], [473,34]], dtype=np.float)

In [122]: ind = np.in1d(b[:,1], a[:,1])

In [123]: np.column_stack([a[:,0]-b[ind,0], b[ind,1]])
Out[123]: 
array([[ -2.,   0.],
       [ 14.,  15.],
       [ -7.,  28.]])

What is returned from numpy.in1d is a Boolean array that tells you whether the ith value in the first input of numpy.in1d can be found anywhere in the second input of numpy.in1d. To see what this looks like, given your data, we get:

In [124]: ind
Out[124]: array([ True, False,  True, False,  True, False], dtype=bool)

As you can see, both the first, third and fifth values in b can be found in a. We simply slice into b and extract the right rows and these rows of the sliced result will have the second column values line up exactly with those second column values in a. We then subtract the first columns of both a and the intermediate result together.

A more clean approach would be to slice into b and extract the entire matrix instead of just the first column, then just subtract the first column with a and this intermediate result:

In [125]: out = b[ind]

In [126]: out[:,0] = a[:,0] - out[:,0]

In [127]: out
Out[127]: 
array([[  -2.,   0.],
       [  14.,  15.],
       [  -7.,  28.]])

numpy - select the rows where a column is equal in two arrays

Answers (2)

Related Questions