Reputation: 5635
OS: Windows 10 (x64), Build 1909
Python Version: 3.8.10
Numpy Version: 1.21.2
Given two 2D (N, 3)
Numpy arrays of (x, y, z)
floating-point data points, what is the Pythonic (vectorized) way to find the indices in one array where points are equal to the points in the other array?
(NOTE: My question differs in that I need this to work with real-world data sets where the two data sets may differ by floating point error. Please read on below for details.)
Very similar questions have been asked many times:
SO Post 1 provides a working list comprehension solution, but I am looking for a solution that will scale well to large data sets (i.e. millions of points):
Code 1:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
]
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
]
)
inds = [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array
if all(sarr == barr)
]
print(inds)
Output 1:
[1, 2]
Attempting the solution of SO Post 3 (similar to SO Post 2), but using floats does not work (and I suspect something using np.isclose
will be needed):
Code 3:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
inds = np.nonzero(
np.in1d(big_array.view("f,f").reshape(-1), small_array.view("f,f").reshape(-1))
)[0]
print(inds)
Output 3:
[ 3 4 5 8 9 10 11]
I have tried numpy.isin
with np.all
and np.argwhere
inds = np.argwhere(np.all(np.isin(big_array, small_array), axis=1)).reshape(-1)
which works (and, I argue, much more readable and understandable; i.e. pythonic), but will not work for real-world data sets containing floating-point errors:
import numpy as np
if __name__ == "__main__":
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0, 3.0, 0.12],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array_fpe = np.array(
[
[5.0 + 1e-9, 3.0 + 1e-9, 0.12 + 1e-9],
[-9.0 + 1e-9, 0.0 + 1e-9, 13.0 + 1e-9],
],
dtype=float,
)
inds_no_fpe = np.argwhere(np.all(np.isin(big_array, small_array), axis=1)).reshape(-1)
inds_with_fpe = np.argwhere(
np.all(np.isin(big_array, small_array_fpe), axis=1)
).reshape(-1)
print(f"No Floating Point Error: {inds_no_fpe}")
print(f"With Floating Point Error: {inds_with_fpe}")
print(f"Are 5.0 and 5.0+1e-9 close?: {np.isclose(5.0, 5.0 + 1e-9)}")
Output:
No Floating Point Error: [1 3]
With Floating Point Error: []
Are 5.0 and 5.0+1e-9 close?: True
How can I make my above solution work (on data sets with floating point error) by incorporating np.isclose
? Alternative solutions are welcome.
NOTE: Since small_array
is a subset of big_array
, using np.isclose
directly doesn't work because the shapes won't broadcast:
np.isclose(big_array, small_array_fpe)
yields
ValueError: operands could not be broadcast together with shapes (4,3) (2,3)
Currently, the only working solution I have is
inds_with_fpe = [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array_fpe
if np.all(np.isclose(sarr, barr))
]
Upvotes: 0
Views: 1397
Reputation: 6482
As @Michael Anderson already mentioned this can be implemented using a kd-tree. In comparsion to your answer this solution is using an absolute error. If this is acceptable or not depends on the problem.
Example
import numpy as np
from scipy import spatial
def find_nearest(big_array,small_array,tolerance):
tree_big=spatial.cKDTree(big_array)
tree_small=spatial.cKDTree(small_array)
return tree_small.query_ball_tree(tree_big,r=tolerance)
Timings
big_array=np.random.rand(100_000,3)
small_array=np.random.rand(1_000,3)
big_array[1000:2000]=small_array
%timeit find_nearest(big_array,small_array,1e-9) #find all pairs within a distance of 1e-9
#55.7 ms ± 830 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#A. Hendry
%timeit np.argwhere(np.isclose(small_array, big_array[:, None, :]).all(-1).any(-1)).reshape(-1)
#3.24 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 2
Reputation: 5635
Credit to @AndrasDeak for this answer
The following code snippet
inds_with_fpe = np.argwhere(
np.isclose(small_array_fpe, big_array[:, None, :]).all(-1).any(-1)
).reshape(-1)
will make the code work. The corresponding output is now:
No Floating Point Error: [1 3]
With Floating Point Error: [1, 3]
Are 5.0 and 5.0+1e-9 close?: True
None
in the above creates a new axis (same as np.newaxis
). This changes the shape of the big_array
array to (4, 1, 3)
, which adheres to broadcasting rules and permits np.isclose
to run. That is, big_array
is now a set of 4
1 x 3
points, and since one of the axes in big_array
is 1, small_array_fpe
can be broadcast to 2
1 x 3
arrays (i.e. shape (2, 1, 3)
) and the elements can be compared element-wise.
The result is a (4, 2, 3)
boolean array; every element of big_array
is compared element-wise to every element of small_array_fpe
and the components where they are close (within a specific tolerance) is returned. Since all
is called as an object method rather than a numpy function, the first argument to the function is actually the axis
rather than the input array. Hence, -1
in the above functions means "the last axis of the array".
We first return the indeces of the (4, 2, 3)
array that are all True
(i.e. all (x, y, z)
components are equal), which yields a 4 x 2
array. Where any of these are True
is the corresponding index in big_array
where the points are equal, yielding a 4 x 1
array.
argwhere
returns indices grouped by element, so its shape is normally (number nonzero items, num dims of input array)
, hence we flatten it into a 1d
array with reshape(-1)
.
Unfortunately, this requires a quadratic amount memory w.r.t. the number of points in each array, since we must run through every element of big_array
and check it against every element of small_array_fpe
. For example, to search for 10,000 points in a set of another 10,000 points, for 32-bit floating point data, requires
Memory = 10000 * 10000 * 4 * 8 = 32 GiB RAM!
If anyone can devise a solution with a faster run time and reasonable amount of memory, that would be fantastic!
FYI:
from timeit import timeit
import numpy as np
big_array = np.array(
[
[1.0, 2.0, 1.2],
[5.0, 3.0, 0.12],
[-1.0, 14.0, 0.0],
[-9.0, 0.0, 13.0],
],
dtype=float,
)
small_array = np.array(
[
[5.0 + 1e-9, 3.0 + 1e-9, 0.12 + 1e-9],
[10.0, 2.0, 5.8],
[-9.0 + 1e-9, 0.0 + 1e-9, 13.0 + 1e-9],
],
dtype=float,
)
def approach01():
return [
ndx
for ndx, barr in enumerate(big_array)
for sarr in small_array
if np.all(np.isclose(sarr, barr))
]
def approach02():
return np.argwhere(
np.isclose(small_array, big_array[:, None, :]).all(-1).any(-1)
).reshape(-1)
if __name__ == "__main__":
time01 = timeit(
"approach01()",
number=10000,
setup="import numpy as np; from __main__ import approach01",
)
time02 = timeit(
"approach02()",
number=10000,
setup="import numpy as np; from __main__ import approach02",
)
print(f"Approach 1 (List Comprehension): {time01}")
print(f"Approach 2 (Vectorized): {time02}")
Output:
Approach 1 (List Comprehension): 8.1180582
Approach 2 (Vectorized): 0.9656997
Upvotes: 0
Reputation: 73490
I'm not going to give any code, but I've dealt with problems similar to this on a large scale. I suspect that to get decent performance with either of these approaches you'll need to implement the core in C (you might get away with using numba).
If both your arrays are huge there are a few approaches that can work. Primarily these boil down to building a structure that can be used to find the nearest neighbor of a point from one of the arrays, and then querying it for each point in the other data set.
To do this I've previously used a Kd Tree approach, and a grid based approach.
The basis of the grid based approach is
The edge cases you need to handle are
The downsides are that this is bad for data that is not uniformly distributed.
The upside is that it is relatively simple. Expected run time for uniform data is n1 * n2 / (L*N*M)
(compared to n1*n2). Typically you select L,N,M such that this becomes O(n log(n))
. You also get some further uplift from sorting the second array to improve reuse of the bins. It is also relatively easy to parallelize (both the binning and searching)
The K-d Tree approach is similar. IIRC it gives O(n log(n))
behavior, but it is trickier to implement, and the building of the data structure is tricky to parallelize). It tends to not be as cache friendly which can mean that although its asymptotic run-time is better than the grid based approach it can runs slower in practice. However it does give better guarantees for non-uniformly distributed data.
Upvotes: 2