Reputation: 48
I am trying to find the euclidean distance between elements of two data sets. Each has millions of elements. After calculating euclidean distance, I need the closest match. Given the number of elements, it will take days to finish
Below is the code I am trying. I also tried using distance from scipy.spatial. But even that is taking forever
from sklearn.metrics.pairwise import euclidean_distances
df =pd.DataFrame(euclidean_distances(df1,df2))
df.index = df1.index
df.columns = df2.index
df['min_distance'] = df.min(axis=1)
df['min_distance_id'] = df.idxmin(axis=1)
Is there any other way to get the output in lesser time.
Upvotes: 1
Views: 2178
Reputation: 42139
I wrote this solution for 2D point lists using numpy. It will quickly find the closest pair of points between two arrays of points. I tried it with two lists of 10 million points each and got the answer in about 4 minutes. With 2 million points on each side, it only took 42 seconds. I don't know if that will be good enough for your needs but it is definitely faster than "days". It also gives good performance for higher dimensions if you need that as well.
def closest(A,B):
def bruteForce(A,B):
d = None
swap = A.shape[0] > B.shape[0]
if swap: A,B = B,A
for pA in A:
daB = np.sum((pA-B)**2,axis=1)
iMin = np.argmin(daB)
if d is None or daB[iMin] < d:
a,b = pA,B[iMin]
d = sum((a-b)**2)
if swap: a,b = b,a
return a,b,sqrt(d)
# small sizes are faster using brute force
if A.shape[0] * B.shape[0] < 1000000 \
or A.shape[0] < 20 or B.shape[0] < 20:
return bruteForce(A,B)
# find center position
midA = np.sum(A,axis=0)/A.shape[0]
midB = np.sum(B,axis=0)/B.shape[0]
midAB = (midA+midB)/2
# closest A to center position
A2midAB = np.sum((A-midAB)**2,axis=1)
iA = np.argmin(A2midAB)
pA = A[iA]
# closest B to pA
B2pA = np.sum((B-pA)**2,axis=1)
iB = np.argmin(B2pA)
pB = B[iB]
dAB = sqrt(sum((pA-pB)**2))
# distance of zero is best solution, return immediately
if dAB == 0: return pA,pB,dAB
# slope of ptA-ptB segment
if pA[0] == pB[0]: p,m = 0,1
else: p,m = 1,(pB[1]-pA[1])/(pB[0]-pA[0])
# perpendicular line intersections with x axis from each point
xA = m*A[:,1] + p*A[:,0]
xB = m*B[:,1] + p*B[:,0]
# baselines for ptA and ptB
baseA = xA[iA]
baseB = xB[iB]
rightSide = (baseB > baseA)
# partitions
ArightOfA = (xA > baseA) == rightSide
BrightOfA = (xB > baseA) == rightSide
AleftOfB = (xA > baseB) != rightSide
BleftOfB = (xB > baseB) != rightSide
# include pB and exclude pA (we already know its closest point in B)
ArightOfA[iA] = False
AleftOfB[iA] = False
BleftOfB[iB] = True
BrightOfA[iB] = True
# recurse left side
if np.any(AleftOfB) and np.any(BleftOfB):
lA,lB,lD = closest(A[AleftOfB],B[BleftOfB])
if lD < dAB: pA,pB,dAB = lA,lB,lD
# resurse right side
if np.any(ArightOfA) and np.any(BrightOfA):
rA,rB,rD = closest(A[ArightOfA],B[BrightOfA])
if rD < dAB: pA,pB,dAB = rA,rB,rD
return pA,pB,dAB
Tested using two random set of 2D points with 10 million points each:
dimCount = 2
ACount = 10000000
ASpread = ACount
BCount = ACount-1
BSpread = BCount
A = np.random.random((ACount,dimCount))*ASpread-ASpread/2
B = np.random.random((BCount,dimCount))*BSpread-BSpread/2
a,b,d = closest(A,B)
print("closest points:",a,b,"distance:",d)
# closest points: [-4422004.2963273 2783038.35968559] [-4422004.76974851 2783038.61468366] distance: 0.5377282447465505
The way it works is by dividing the A and B points based on a strategically selected pair (pA,pB). The line between pA and pB serves as a partition for points of the two lists. Each side of this partitioning is then used recursively to find other (closer) pairs of points.
Graphically, this corresponds to a partition based on perpendicular lines of the pA-pB segment:
The strategy for selecting pA and pB is to find the approximate center of the two groups of points and pick a point (pA) from list A that is close to that center. Then select the closest point to pA in list B. This ensures that there are no points in between the two perpendicular lines that are closer to pA or pB in the other list.
Points of A and B that are on opposite sides of the perpendicular lines are necessarily farther away from each other than pA-pB so they can be isolated in two sub-lists and processed separately.
This allows a "divide and conquer" approach that greatly reduces the number of point-to-point distances to compare.
In my tests (with randomly distributed points) the performance seemed to be linear in proportion to the total number of points in A and B. I tried skewing the distribution by creating small clusters of points far appart (so that no point is actually near the approximate center) and the performance was still linear. I'm not sure if there are any "worst case" point distributions that will cause a drop in performance (I haven't found one yet)
Upvotes: 1
Reputation: 2004
Did you look at scipy.spatial.cKDTree
?
You can construct this data structure for one of your data set, and query it to get the distance for each point in the second data set.
KDTree = scipy.spatial.cKDTree(df1)
distances, indexes = KDTree.query(df2, n_jobs=-1)
I set here n_jobs=-1
to use all available processors.
Upvotes: 4