00__00__00
00__00__00

Reputation: 5327

Check if all rows in one array are present in another bigger array

I have a large numpy array

X= np.random.rand(1000,1000)

then I have a smaller numpy array, which in this example could be generated by

Y =X [[3,7,921],:]

I would like to check programmatically whether all the rows of Y are in X.

The solution should scale up well. Possibly without requiring additional dependencies.

Inspired by: Get intersecting rows across two 2D numpy arrays

So far I have tried:

np.all([s in set([tuple(X) for x in X]) for s in set([tuple(y) for y in Y])])

but my logic must be flawed since I still get False when it should be True

Upvotes: 2

Views: 112

Answers (1)

Divakar
Divakar

Reputation: 221504

Here's one approach with views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

X1D, Y1D = view1D(X,Y)
out = np.in1d(X1D, Y1D).sum() == len(Y1D)

Sample run -

In [66]: X = np.random.rand(1000,1000)

In [67]: Y = X [[3,7,921],:]

In [68]: X1D, Y1D = view1D(X,Y)

In [69]: np.in1d(X1D, Y1D).sum() == len(Y1D)
Out[69]: True

In [70]: Y[2] = 1

In [71]: X1D, Y1D = view1D(X,Y)

In [72]: np.in1d(X1D, Y1D).sum() == len(Y1D)
Out[72]: False

Benchmarking and alternatives

The proposed way seems pretty fast -

In [73]: %timeit np.in1d(X1D, Y1D).sum() == len(Y1D)
10000 loops, best of 3: 39.6 µs per loop

We can also do - np.in1d(Y1D, X1D).all(), but I think this would iterate through the elements of X1D for a match in Y1D. Now, in our case, it seems X1D i.e. X is a larger array, so this would be computationally more heavy than using np.in1d(X1D, Y1D) as with the earlier proposed one -

In [84]: %timeit np.in1d(Y1D, X1D)
100 loops, best of 3: 5.82 ms per loop

In [85]: %timeit np.in1d(X1D, Y1D)
10000 loops, best of 3: 34.1 µs per loop

Hence, the alternative solution would be slower -

In [79]: %timeit np.in1d(Y1D, X1D).all()
100 loops, best of 3: 5.99 ms per loop

Upvotes: 2

Related Questions