shmulvad
shmulvad

Reputation: 638

Find all close numerical matches in two 2D arrays

Update: I made the solution into a library called close-numerical-matches.


I am looking for a way to find all close matches (within some tolerance) between two 2D arrays and get an array of the indices of the found matches. Multiple answers on SO show how to solve this problem for exact matches (typically with a dictionary), but that is not what I am looking for. Let me give an example:

>>> arr1 = [
    [19.21, 19.19],
    [13.18, 11.55],
    [21.45,  5.83]
]
>>> arr2 = [
    [13.11, 11.54],
    [19.20, 19.19],
    [51.21, 21.55],
    [19.22, 19.18],
    [11.21, 11.55]
]
>>> find_close_match_indices(arr1, arr2, tol=0.1)
[[0, 1], [0, 3], [1, 0]]

Above, [[0, 1], [0, 3], [1, 0]] is returned because element 0 in arr1, [19.21, 19.19] is within tolerance to elements 1 and 3 in arr2. Order is not important to me, i.e. [[0, 3], [1, 0], [0, 1]] would be just as acceptable.

The shape of arr1 is (n, 2) and arr2 is (m, 2). You can expect that n and m will be huge. Now, I can easily implement this using a nested for loop but I am sure there must be some smarter way than comparing every element against all other elements.

I thought about using k-means clustering to divide the problem into k buckets and thus make the nested for-loop approach more tractable, but I think there may be a small risk two close elements are just at the "border" of each of their clusters and therefore wouldn't get compared.

Any external dependencies such as Numpy, Scipy, etc. are fine and it is fine as well as to use O(n + m) space.

Upvotes: 1

Views: 201

Answers (3)

shmulvad
shmulvad

Reputation: 638

I got an idea for how to use buckets to solve this problem. The idea is that a key is formed based on the values of the elements and the tolerance level. To make sure potential matches that were in the "edge" of the bucket are compared against other element at "edges", all neighbour buckets are compared. Finally, I modified @Tim Roberts' approach for performing the actual matching slightly to match on both columns.

I made this into a library called close-numerical-matches. Sample usage:

>>> import numpy as np
>>> from close_numerical_matches import find_matches
>>> arr0 = np.array([[25, 24], [50, 50], [25, 26]])
>>> arr1 = np.array([[25, 23], [25, 25], [50.6, 50.6], [60, 60]])
>>> find_matches(arr0, arr1, tol=1.0001)
array([[0, 0], [0, 1], [1, 2], [2, 1]])
>>> find_matches(arr0, arr1, tol=0.9999)
array([[1, 2]])
>>> find_matches(arr0, arr1, tol=0.60001)
array([], dtype=int64)
>>> find_matches(arr0, arr1, tol=0.60001, dist='max')
array([[1, 2]])
>>> manhatten_dist = lambda arr: np.sum(np.abs(arr), axis=1)
>>> matches = find_matches(arr0, arr1, tol=0.11, dist=manhatten_dist)
>>> matches
array([[0, 1], [0, 1], [2, 1]])
>>> indices0, indices1 = matches.T
>>> arr0[indices0]
array([[25, 24], [25, 24], [25, 26]])

Some profiling:

from timeit import default_timer as timer
import numpy as np
from close_numerical_matches import naive_find_matches, find_matches

arr0 = np.random.rand(320_000, 2)
arr1 = np.random.rand(44_000, 2)

start = timer()
naive_find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start)  # 255.335 s

start = timer()
find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start)  # 5.821 s

Upvotes: 0

NNN
NNN

Reputation: 593

Perhaps you might find the following useful. Might be faster than @Tim-Roberts 's solution because there are no explicit for loops. But it will use more storage.

import numpy as np

xarr1 = np.array([
    [19.21, 19.19],
    [13.18, 11.55],
    [21.45,  5.83]
])
xarr2 = np.array([
    [13.11, 11.54],
    [19.20, 19.19],
    [51.21, 21.55],
    [19.22, 19.18],
    [11.21, 11.55]
])

tol=0.1
xarr1=xarr1[:,None,:]
xarr2=xarr2[None,:,:]
# broadcasting
cc = xarr2-xarr1
cc = np.apply_along_axis(np.linalg.norm,-1,cc)
# or you can use other metrics of closeness e.g. as below
#cc = np.apply_along_axis(np.abs,-1,cc) 
#cc = np.apply_along_axis(np.max,-1,cc)
id1,id2=np.where(cc<tol)

Upvotes: 0

Tim Roberts
Tim Roberts

Reputation: 54867

You can't do it with NO loops, but you can do it with ONE loop by taking advantage of the boolean indexing:

import numpy as np

xarr1 = np.array([
    [19.21, 19.19],
    [13.18, 11.55],
    [21.45,  5.83]
])
xarr2 = np.array([
    [13.11, 11.54],
    [19.20, 19.19],
    [51.21, 21.55],
    [19.22, 19.18],
    [11.21, 11.55]
])

def find_close_match_indices(arr1, arr2, tol=0.1):
    results = []
    for i,r1 in enumerate(arr1[:,0]):
        x1 = np.abs(arr2[:,0]-r1) < tol
        results.extend( [i,k] for k in np.where(x1)[0] )
    return results

print(find_close_match_indices(xarr1,xarr2,0.1))

Output:

[[0, 1], [0, 3], [1, 0]]

Upvotes: 1

Related Questions