Reputation: 113905
The actual problem is in some machine learning application, and the data gets a little complex. So here's an MWE that captures the essence of the problem:
I have two arrays made as follows:
L = np.arange(12).reshape(4,3)
M = np.arange(12).reshape(6,2)
Now, I want to find the rows R in L
, such that there exists some row in M
that is made up of all the elements in R except the last one.
From the above example code, L
and M
look like this:
array([[ 0, 1, 2], # L
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
array([[ 0, 1], # M
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
I would like from these, the marked rows in L
, as a numpy array:
array([[ 0, 1, 2],
[ 6, 7, 8]])
If I were representing L
and M
as python lists, I would have done this:
L = [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]]
M = [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11]]
answer = [R for R in L if R[:-1] in M]
Now, I know that I can use a similar list comprehension in numpy and cast the result into an array, numpy being as awesome as it is, probably has a more elegant way to do this that I don't know about.
I tried looking into np.where
(to get the required indices, which I can then subscipt L with), but that doesn't seem to do what I need.
I'd appreciate any help
Upvotes: 3
Views: 1358
Reputation: 23743
>>> print np.array([row for row in L if row[:-1] in M])
[[0 1 2]
[6 7 8]]
Upvotes: 0
Reputation: 1931
Quite similar to Bitwise's answer :
def fn(a):
return lambda b: np.all(a==b, axis=1)
matches = np.apply_along_axis(fn(M), 1, L[:,:2])
result = L[np.any(matches, axis=1)]
What happens under the hood is something like this (I'll use Bitwise's example, which is easier to demonstrate) :
>>> M
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
>>> M.shape+=(1,)
>>> M
array([[[ 0],
[ 1]],
[[ 2],
[ 3]],
[[ 4],
[ 5]],
[[ 6],
[ 7]],
[[ 8],
[ 9]],
[[10],
[11]]])
Here we have added another dimension to the M array, which is now (6,2,1).
>>> L2 = L[:,:-1].T
Then we get rid of the last column of 2, and transpose the array, so that the dimension is (2,4)
And here is the magic, M and L2 are now broadcastable to arrays of dimension (6,2,4).
As numpy's doc states :
A set of arrays is called “broadcastable” to the same shape if the above rules produce a valid result, i.e., one of the following is true:
The arrays all have exactly the same shape. The arrays all have the same number of dimensions and the length of each dimensions is either a common length or 1. The arrays that have too few dimensions can have their shapes prepended with a dimension of length 1 to satisfy property 2.
Example
If a.shape is (5,1), b.shape is (1,6), c.shape is (6,) and d.shape is () so that d is a scalar, then a, b, c, and d are all broadcastable to dimension (5,6); and
a acts like a (5,6) array where a[:,0] is broadcast to the other columns, b acts like a (5,6) array where b[0,:] is broadcast to the other rows, c acts like a (1,6) array and therefore like a (5,6) array where c[:] is broadcast to every row, and finally, d acts like a (5,6) array where the single value is repeated.
M[:,:,0] will be repeated 4 times to fill the 3 dim, and L2 will be prepended a new dimension and be repeated 6 times to fill it.
>>> B = np.broadcast_arrays(L2,M)
>>> B
[array([[[ 0, 3, 6, 9],
[ 1, 4, 7, 10]],
[[ 0, 3, 6, 9],
[ 1, 4, 7, 10]],
[[ 0, 3, 6, 9],
[ 1, 4, 7, 10]],
[[ 0, 3, 6, 9],
[ 1, 4, 7, 10]],
[[ 0, 3, 6, 9],
[ 1, 4, 7, 10]],
[[ 0, 3, 6, 9],
[ 1, 4, 7, 10]]]),
array([[[ 0, 0, 0, 0],
[ 1, 1, 1, 1]],
[[ 2, 2, 2, 2],
[ 3, 3, 3, 3]],
[[ 4, 4, 4, 4],
[ 5, 5, 5, 5]],
[[ 6, 6, 6, 6],
[ 7, 7, 7, 7]],
[[ 8, 8, 8, 8],
[ 9, 9, 9, 9]],
[[10, 10, 10, 10],
[11, 11, 11, 11]]])]
We can now compare them element-wise :
>>> np.equal(*B)
array([[[ True, False, False, False],
[ True, False, False, False]],
[[False, False, False, False],
[False, False, False, False]],
[[False, False, False, False],
[False, False, False, False]],
[[False, False, True, False],
[False, False, True, False]],
[[False, False, False, False],
[False, False, False, False]],
[[False, False, False, False],
[False, False, False, False]]], dtype=bool)
Row to row (axis = 1):
>>> np.all(np.equal(*B), axis=1)
array([[ True, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, True, False],
[False, False, False, False],
[False, False, False, False]], dtype=bool)
Aggregate on L's :
>>> C = np.any(np.all(np.equal(*B), axis=1), axis=0)
>>> C
array([ True, False, True, False], dtype=bool)
And this gives you the boolean mask to apply to L.
>>> L[C]
array([[0, 1, 2],
[6, 7, 8]])
apply_along_axis
will leverage the same feature, but reducing L's dimension instead of increasing M's (thus adding implicit loops).
Upvotes: 3
Reputation: 7807
Ok, I think I got it. The trick is to add another dimension to M
, and then you can use broadcasting:
M.shape += (1,)
E = np.all(L[:,:-1].T == M, 1)
and you get a 6x4 boolean matrix E
that gives you the results of comparing all rows of L with all rows of M.
From here it is easy to finish:
result = L[np.any(E,0)]
This way the solution is streamlined and you don't need any lambda functions or "implicit loops" (e.g. np.apply_along_axis()
).
Yes, numpy vectorization is beautiful (but you have to think quite abstract sometimes)...
Upvotes: 4
Reputation: 77951
>>> import hashlib
>>> fn = lambda xs: hashlib.sha1(xs).hexdigest()
>>> m = np.apply_along_axis(fn, 1, M)
>>> l = np.apply_along_axis(fn, 1, L[:,:-1])
>>> L[np.in1d(l, m)]
array([[0, 1, 2],
[6, 7, 8]])
Upvotes: 0