Marcelo Villa
Marcelo Villa

Reputation: 1131

Is there a vectorized way to check whether the ith element of a 1D array is present on the ith element of a 3D array?

I have a 1D array of length k with some arbitrary values and a 3D array of dimensions k * i * j with some data.

import numpy as np

# create 1D and 3D array
values = np.array([2, 5, 1], dtype=np.int)
arr = np.zeros((3, 4, 4), dtype=np.int)

# insert some random numbers in the 3D array
arr[0, 3, 2] = 5
arr[1, 1, 1] = 2
arr[2, 2, 3] = 1
>>> print(values)
[2 5 1]

>>> print(arr)
[[[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 5 0]]

 [[0 0 0 0]
  [0 2 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 1]
  [0 0 0 0]]]

My goal is to determine if the ith element of values (i.e. a scalar) is present in the ith element of arr (i.e. a 2D array) and get a boolean array of length k.

In my example, I would expect to get an array [False, False, True] as 1 is the only number present in its correspondent 2D array (arr[2]).

As np.isin function is not an option, I have come up with two possible solutions so far.

1) Create a 3D array by repeating the numbers in values and then do an elementh-wise comparison:

rep = np.ones(arr.shape) * values.reshape(-1, 1, 1)
>>> print(rep)
[[[2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]]

 [[5. 5. 5. 5.]
  [5. 5. 5. 5.]
  [5. 5. 5. 5.]
  [5. 5. 5. 5.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]

>>> np.any((arr == rep), axis=(1, 2))
array([False, False,  True])

However, this approach seems like a bad idea from a memory perspective if both values and arr have bigger shapes.

2) Iterate over each value in values and check if it is present in its correspondent 2D array of arr.

result = []
for i, value in enumerate(values):
    result.append(value in arr[i])
>>> print(result)
[False, False, True]

This approach is of course better from a memory perspective but again, when implemented with bigger arrays it can become time consuming (think of k being 1000000 instead of 3).

Is there any other numpy function I am missing or perhaps a better approach to accomplish my goal here?

I already took a look at the answers to a similar question but they do not fit my use case.

Upvotes: 1

Views: 154

Answers (4)

max9111
max9111

Reputation: 6482

As hpaulj already mentioned numba can be an option here.

Example

import numpy as np
import numba as nb

#Turn off parallelization for tiny problems
@nb.njit(parallel=True)
def example(values,arr):
    #Make sure that the first dimension is the same
    assert arr.shape[0]==values.shape[0]
    out=np.empty(values.shape[0],dtype=nb.bool_)

    for i in nb.prange(arr.shape[0]):
        out[i]=False
        for j in range(arr.shape[1]):
            if arr[i,j]==values[i]:
                out[i]=True
                break
    return out

Timings (small arrays)

#your input data
%timeit example(values,arr.reshape(arr.shape[0],-1))# #parallel=True
#10.7 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit example(values,arr.reshape(arr.shape[0],-1))# #parallel=False
#2.15 µs ± 49.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Methods from other answers
%timeit (values[:,None,None]==arr).any(axis=(1,2))
#9.52 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit [(i==a).any() for i,a in zip(values, arr)]
#23.9 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Timings (larger arrays)

values=np.random.randint(low=1,high=100_000,size=1_000_000)
arr=np.random.randint(low=1,high=10_00,size=1_000_000*100).reshape(1_000_000,10,10)

%timeit example(values,arr.reshape(arr.shape[0],-1)) #parallel=True
#48.2 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit example(values,arr.reshape(arr.shape[0],-1)) #parallel=False
#90.5 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Methods from other answers
%timeit (values[:,None,None]==arr).any(axis=(1,2))
#186 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [(i==a).any() for i,a in zip(values, arr)]
#6.63 s ± 69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

hpaulj
hpaulj

Reputation: 231385

You've basically identified the two options:

In [35]: timeit [(i==a).any() for i,a in zip(values, arr)]                                     
29 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: timeit (values[:,None,None]==arr).any(axis=(1,2))                                     
11.4 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this small case the big array approach is faster. But for a larger case, the iteration might be better. Memory management with the larger arrays may cancel out the time savings. It's often the case that a few iterations on a complex problem are better than the fully 'vectorized' version.

If it's something you do repeatedly, you could take the time craft a hybrid solution, one that iterates on blocks. But you'd have to judge that yourself.

isin and related code either ors some tests, or using sort of some sort to put like values next to each other for easy comparison.

The other approach is to write a fully iterative solution, and let numba compile it for you.

Upvotes: 1

Sam Mason
Sam Mason

Reputation: 16184

using broadcasting might help:

np.any(values[:,None,None] == arr, axis=(1,2))

is a one liner that gives [False,False,True]. note that if you're storing arr then storing a similar bool array shouldn't be too bad

note that it's the values[:,None,None] == arr that's doing the broasting, strange indexing with None being equivalent to your reshape (but feels more idiomatic to me)

Upvotes: 2

mathfux
mathfux

Reputation: 5949

I have found that your problem is equivalent to

[np.any(arr[i]==values[i]) for i in range(len(values))]

I agree this is time consuming. Elementwise comparison can't be avoided here so np.any(arr[i]==values[i]) or values[i] in arr[i] is must-do here. What about vectorizations, I found it quite difficult to replace list comprehension used here too. This is my way using np.vectorize:

def myfunc(i): return np.any(arr[i]==values[i])
vfunc = np.vectorize(myfunc)
vfunc(np.arange(len(values)))
# output: array([False, False,  True])

Upvotes: 1

Related Questions