Reputation: 5327
I have a large numpy
array
X= np.random.rand(1000,1000)
then I have a smaller numpy
array, which in this example could be generated by
Y =X [[3,7,921],:]
I would like to check programmatically whether all the rows of Y
are in X
.
The solution should scale up well. Possibly without requiring additional dependencies.
Inspired by: Get intersecting rows across two 2D numpy arrays
So far I have tried:
np.all([s in set([tuple(X) for x in X]) for s in set([tuple(y) for y in Y])])
but my logic must be flawed since I still get False
when it should be True
Upvotes: 2
Views: 112
Reputation: 221504
Here's one approach with views
-
# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
X1D, Y1D = view1D(X,Y)
out = np.in1d(X1D, Y1D).sum() == len(Y1D)
Sample run -
In [66]: X = np.random.rand(1000,1000)
In [67]: Y = X [[3,7,921],:]
In [68]: X1D, Y1D = view1D(X,Y)
In [69]: np.in1d(X1D, Y1D).sum() == len(Y1D)
Out[69]: True
In [70]: Y[2] = 1
In [71]: X1D, Y1D = view1D(X,Y)
In [72]: np.in1d(X1D, Y1D).sum() == len(Y1D)
Out[72]: False
The proposed way seems pretty fast -
In [73]: %timeit np.in1d(X1D, Y1D).sum() == len(Y1D)
10000 loops, best of 3: 39.6 µs per loop
We can also do - np.in1d(Y1D, X1D).all()
, but I think this would iterate through the elements of X1D
for a match in Y1D
. Now, in our case, it seems X1D
i.e. X
is a larger array, so this would be computationally more heavy than using np.in1d(X1D, Y1D)
as with the earlier proposed one -
In [84]: %timeit np.in1d(Y1D, X1D)
100 loops, best of 3: 5.82 ms per loop
In [85]: %timeit np.in1d(X1D, Y1D)
10000 loops, best of 3: 34.1 µs per loop
Hence, the alternative solution would be slower -
In [79]: %timeit np.in1d(Y1D, X1D).all()
100 loops, best of 3: 5.99 ms per loop
Upvotes: 2