Efficient euclidean distance calculation in python for millions of rows

I am trying to find the euclidean distance between elements of two data sets. Each has millions of elements. After calculating euclidean distance, I need the closest match. Given the number of elements, it will take days to finish

Below is the code I am trying. I also tried using distance from scipy.spatial. But even that is taking forever

from sklearn.metrics.pairwise import euclidean_distances
df =pd.DataFrame(euclidean_distances(df1,df2))
df.index =  df1.index
df.columns = df2.index
df['min_distance'] = df.min(axis=1)
df['min_distance_id'] = df.idxmin(axis=1)

Is there any other way to get the output in lesser time.

Upvotes: 1

Answers (2)

Alain T.

Reputation: 42139

I wrote this solution for 2D point lists using numpy. It will quickly find the closest pair of points between two arrays of points. I tried it with two lists of 10 million points each and got the answer in about 4 minutes. With 2 million points on each side, it only took 42 seconds. I don't know if that will be good enough for your needs but it is definitely faster than "days". It also gives good performance for higher dimensions if you need that as well.

def closest(A,B):

    def bruteForce(A,B):
        d = None
        swap = A.shape[0] > B.shape[0]
        if swap: A,B = B,A
        for pA in A:
            daB  = np.sum((pA-B)**2,axis=1)
            iMin = np.argmin(daB)
            if d is None or daB[iMin] < d:
                a,b = pA,B[iMin]
                d   = sum((a-b)**2)
        if swap: a,b = b,a
        return a,b,sqrt(d)

    # small sizes are faster using brute force
    if A.shape[0] * B.shape[0] < 1000000 \
    or A.shape[0] < 20 or B.shape[0] < 20:
        return bruteForce(A,B)

    # find center position
    midA  = np.sum(A,axis=0)/A.shape[0]
    midB  = np.sum(B,axis=0)/B.shape[0]
    midAB = (midA+midB)/2

    # closest A to center position
    A2midAB  = np.sum((A-midAB)**2,axis=1)
    iA       = np.argmin(A2midAB)    
    pA       = A[iA]

    # closest B to pA
    B2pA     = np.sum((B-pA)**2,axis=1)
    iB       = np.argmin(B2pA)
    pB       = B[iB]
    dAB      = sqrt(sum((pA-pB)**2))

    # distance of zero is best solution, return immediately
    if dAB == 0: return pA,pB,dAB

    # slope of ptA-ptB segment
    if pA[0] == pB[0]: p,m = 0,1 
    else:              p,m = 1,(pB[1]-pA[1])/(pB[0]-pA[0])

    # perpendicular line intersections with x axis from each point
    xA = m*A[:,1] + p*A[:,0] 
    xB = m*B[:,1] + p*B[:,0]

    # baselines for ptA and ptB
    baseA = xA[iA]
    baseB = xB[iB]
    rightSide = (baseB > baseA) 

    # partitions
    ArightOfA = (xA > baseA) == rightSide
    BrightOfA = (xB > baseA) == rightSide
    AleftOfB  = (xA > baseB) != rightSide
    BleftOfB  = (xB > baseB) != rightSide

    # include pB and exclude pA (we already know its closest point in B)
    ArightOfA[iA] = False
    AleftOfB[iA]  = False
    BleftOfB[iB]  = True
    BrightOfA[iB] = True

    # recurse left side
    if np.any(AleftOfB) and np.any(BleftOfB):
        lA,lB,lD = closest(A[AleftOfB],B[BleftOfB])
        if lD < dAB: pA,pB,dAB = lA,lB,lD

    # resurse right side
    if np.any(ArightOfA) and np.any(BrightOfA):
        rA,rB,rD = closest(A[ArightOfA],B[BrightOfA])
        if rD < dAB: pA,pB,dAB = rA,rB,rD

    return pA,pB,dAB

Tested using two random set of 2D points with 10 million points each:

dimCount = 2
ACount   = 10000000
ASpread  = ACount
BCount   = ACount-1
BSpread  = BCount
A = np.random.random((ACount,dimCount))*ASpread-ASpread/2
B = np.random.random((BCount,dimCount))*BSpread-BSpread/2

a,b,d = closest(A,B)
print("closest points:",a,b,"distance:",d)

# closest points: [-4422004.2963273   2783038.35968559] [-4422004.76974851  2783038.61468366] distance: 0.5377282447465505

The way it works is by dividing the A and B points based on a strategically selected pair (pA,pB). The line between pA and pB serves as a partition for points of the two lists. Each side of this partitioning is then used recursively to find other (closer) pairs of points.

Graphically, this corresponds to a partition based on perpendicular lines of the pA-pB segment:

The strategy for selecting pA and pB is to find the approximate center of the two groups of points and pick a point (pA) from list A that is close to that center. Then select the closest point to pA in list B. This ensures that there are no points in between the two perpendicular lines that are closer to pA or pB in the other list.

Points of A and B that are on opposite sides of the perpendicular lines are necessarily farther away from each other than pA-pB so they can be isolated in two sub-lists and processed separately.

This allows a "divide and conquer" approach that greatly reduces the number of point-to-point distances to compare.

In my tests (with randomly distributed points) the performance seemed to be linear in proportion to the total number of points in A and B. I tried skewing the distribution by creating small clusters of points far appart (so that no point is actually near the approximate center) and the performance was still linear. I'm not sure if there are any "worst case" point distributions that will cause a drop in performance (I haven't found one yet)

Upvotes: 1

One Lyner

Reputation: 2004

Did you look at scipy.spatial.cKDTree ?

You can construct this data structure for one of your data set, and query it to get the distance for each point in the second data set.

KDTree = scipy.spatial.cKDTree(df1)
distances, indexes = KDTree.query(df2, n_jobs=-1)

I set here n_jobs=-1 to use all available processors.

Upvotes: 4

Efficient euclidean distance calculation in python for millions of rows

Answers (2)

Related Questions