Get N minimum distance pairs from pandas dataframe

Question

Consider the following code, which generates a distance matrix from a list of labeled coordinates:

import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform

coord_data = [
    [1, 2],
    [4, 3],
    [5, 8],
    [6, 7],
]

df = pd.DataFrame(coord_data, index=list('ABCD'))

dist_matrix = squareform(pdist(df, metric='euclidean'))
dist_df = pd.DataFrame(dist_matrix, index=df.index, columns=df.index)

print(dist_df)

          A         B         C         D
A  0.000000  3.162278  7.211103  7.071068
B  3.162278  0.000000  5.099020  4.472136
C  7.211103  5.099020  0.000000  1.414214
D  7.071068  4.472136  1.414214  0.000000

Is there an efficient way (using numpy, pandas, etc.) of getting the N minimum distance pairs from this distance matrix?

For instance, if N=2, an output similar to the following is desired for the given example:

[['C', 'D'], ['A', 'B']] # corresponding to minimum distances [1.414214, 3.162278]

Divakar · Accepted Answer

Here's one with np.argpartition for perf. efficiency -

def topN_index_columns_from_symmmdist(df, N):
    a = dist_df.to_numpy(copy=True)
    a[np.tri(len(a), dtype=bool)] = np.inf
    idx = np.argpartition(a.ravel(),range(N))[:N]
    r,c = np.unravel_index(idx, a.shape)
    return list(zip(dist_df.index[r], dist_df.columns[c]))

Sample runs -

In [43]: dist_df
Out[43]: 
          A         B         C         D
A  0.000000  3.162278  7.211103  7.071068
B  3.162278  0.000000  5.099020  4.472136
C  7.211103  5.099020  0.000000  1.414214
D  7.071068  4.472136  1.414214  0.000000

In [44]: topN_index_columns_from_symmmdist(df, N=2)
Out[44]: [('C', 'D'), ('A', 'B')]

In [45]: topN_index_columns_from_symmmdist(df, N=4)
Out[45]: [('C', 'D'), ('A', 'B'), ('B', 'D'), ('B', 'C')]

Get N minimum distance pairs from pandas dataframe

Answers (2)

Related Questions