Reputation: 1131
I have a 1D array of length k with some arbitrary values and a 3D array of dimensions k * i * j with some data.
import numpy as np
# create 1D and 3D array
values = np.array([2, 5, 1], dtype=np.int)
arr = np.zeros((3, 4, 4), dtype=np.int)
# insert some random numbers in the 3D array
arr[0, 3, 2] = 5
arr[1, 1, 1] = 2
arr[2, 2, 3] = 1
>>> print(values)
[2 5 1]
>>> print(arr)
[[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 5 0]]
[[0 0 0 0]
[0 2 0 0]
[0 0 0 0]
[0 0 0 0]]
[[0 0 0 0]
[0 0 0 0]
[0 0 0 1]
[0 0 0 0]]]
My goal is to determine if the ith element of values
(i.e. a scalar) is present in the ith element of arr
(i.e. a 2D array) and get a boolean array of length k.
In my example, I would expect to get an array [False, False, True]
as 1
is the only number present in its correspondent 2D array (arr[2]
).
As np.isin
function is not an option, I have come up with two possible solutions so far.
1) Create a 3D array by repeating the numbers in values
and then do an elementh-wise comparison:
rep = np.ones(arr.shape) * values.reshape(-1, 1, 1)
>>> print(rep)
[[[2. 2. 2. 2.]
[2. 2. 2. 2.]
[2. 2. 2. 2.]
[2. 2. 2. 2.]]
[[5. 5. 5. 5.]
[5. 5. 5. 5.]
[5. 5. 5. 5.]
[5. 5. 5. 5.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]]
>>> np.any((arr == rep), axis=(1, 2))
array([False, False, True])
However, this approach seems like a bad idea from a memory perspective if both values
and arr
have bigger shapes.
2) Iterate over each value in values
and check if it is present in its correspondent 2D array of arr
.
result = []
for i, value in enumerate(values):
result.append(value in arr[i])
>>> print(result)
[False, False, True]
This approach is of course better from a memory perspective but again, when implemented with bigger arrays it can become time consuming (think of k being 1000000 instead of 3).
Is there any other numpy
function I am missing or perhaps a better approach to accomplish my goal here?
I already took a look at the answers to a similar question but they do not fit my use case.
Upvotes: 1
Views: 154
Reputation: 6482
As hpaulj already mentioned numba
can be an option here.
Example
import numpy as np
import numba as nb
#Turn off parallelization for tiny problems
@nb.njit(parallel=True)
def example(values,arr):
#Make sure that the first dimension is the same
assert arr.shape[0]==values.shape[0]
out=np.empty(values.shape[0],dtype=nb.bool_)
for i in nb.prange(arr.shape[0]):
out[i]=False
for j in range(arr.shape[1]):
if arr[i,j]==values[i]:
out[i]=True
break
return out
Timings (small arrays)
#your input data
%timeit example(values,arr.reshape(arr.shape[0],-1))# #parallel=True
#10.7 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit example(values,arr.reshape(arr.shape[0],-1))# #parallel=False
#2.15 µs ± 49.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Methods from other answers
%timeit (values[:,None,None]==arr).any(axis=(1,2))
#9.52 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit [(i==a).any() for i,a in zip(values, arr)]
#23.9 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Timings (larger arrays)
values=np.random.randint(low=1,high=100_000,size=1_000_000)
arr=np.random.randint(low=1,high=10_00,size=1_000_000*100).reshape(1_000_000,10,10)
%timeit example(values,arr.reshape(arr.shape[0],-1)) #parallel=True
#48.2 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit example(values,arr.reshape(arr.shape[0],-1)) #parallel=False
#90.5 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Methods from other answers
%timeit (values[:,None,None]==arr).any(axis=(1,2))
#186 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [(i==a).any() for i,a in zip(values, arr)]
#6.63 s ± 69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 231385
You've basically identified the two options:
In [35]: timeit [(i==a).any() for i,a in zip(values, arr)]
29 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: timeit (values[:,None,None]==arr).any(axis=(1,2))
11.4 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In this small case the big array approach is faster. But for a larger case, the iteration might be better. Memory management with the larger arrays may cancel out the time savings. It's often the case that a few iterations on a complex problem are better than the fully 'vectorized' version.
If it's something you do repeatedly, you could take the time craft a hybrid solution, one that iterates on blocks. But you'd have to judge that yourself.
isin
and related code either ors
some tests, or using sort
of some sort to put like values next to each other for easy comparison.
The other approach is to write a fully iterative solution, and let numba
compile it for you.
Upvotes: 1
Reputation: 16184
using broadcasting might help:
np.any(values[:,None,None] == arr, axis=(1,2))
is a one liner that gives [False,False,True]
. note that if you're storing arr
then storing a similar bool
array shouldn't be too bad
note that it's the values[:,None,None] == arr
that's doing the broasting, strange indexing with None
being equivalent to your reshape
(but feels more idiomatic to me)
Upvotes: 2
Reputation: 5949
I have found that your problem is equivalent to
[np.any(arr[i]==values[i]) for i in range(len(values))]
I agree this is time consuming. Elementwise comparison can't be avoided here so np.any(arr[i]==values[i])
or values[i] in arr[i]
is must-do here. What about vectorizations, I found it quite difficult to replace list comprehension used here too. This is my way using np.vectorize
:
def myfunc(i): return np.any(arr[i]==values[i])
vfunc = np.vectorize(myfunc)
vfunc(np.arange(len(values)))
# output: array([False, False, True])
Upvotes: 1