Enigma Machine
Enigma Machine

Reputation: 153

Get N minimum distance pairs from pandas dataframe

Consider the following code, which generates a distance matrix from a list of labeled coordinates:

import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform

coord_data = [
    [1, 2],
    [4, 3],
    [5, 8],
    [6, 7],
]

df = pd.DataFrame(coord_data, index=list('ABCD'))

dist_matrix = squareform(pdist(df, metric='euclidean'))
dist_df = pd.DataFrame(dist_matrix, index=df.index, columns=df.index)

print(dist_df)
          A         B         C         D
A  0.000000  3.162278  7.211103  7.071068
B  3.162278  0.000000  5.099020  4.472136
C  7.211103  5.099020  0.000000  1.414214
D  7.071068  4.472136  1.414214  0.000000

Is there an efficient way (using numpy, pandas, etc.) of getting the N minimum distance pairs from this distance matrix?

For instance, if N=2, an output similar to the following is desired for the given example:

[['C', 'D'], ['A', 'B']] # corresponding to minimum distances [1.414214, 3.162278]

Upvotes: 3

Views: 240

Answers (2)

Enigma Machine
Enigma Machine

Reputation: 153

Here is another answer using pandas, based on the comment of Laurent R, for the sake of completeness. I ended up using Divakar's solution though.

def topN_index_columns_from_symmmdist2(dist_df, N):
    dist_df = pd.melt(dist_df.reset_index(), id_vars="index")
    dist_df = dist_df.rename(columns={"index": "start", "variable": "end"})
    dist_df = dist_df.sort_values("value")
    dist_df = dist_df.drop_duplicates(subset=["value"], keep="last")
    dist_pair_list = dist_df.iloc[1:N+1, :2].values.tolist()
    return dist_pair_list

Sample outputs:

print(topN_index_columns_from_symmmdist2(dist_df, 2))
print(topN_index_columns_from_symmmdist2(dist_df, 4))
[['C', 'D'], ['A', 'B']]
[['C', 'D'], ['A', 'B'], ['B', 'D'], ['B', 'C']]

Upvotes: 1

Divakar
Divakar

Reputation: 221564

Here's one with np.argpartition for perf. efficiency -

def topN_index_columns_from_symmmdist(df, N):
    a = dist_df.to_numpy(copy=True)
    a[np.tri(len(a), dtype=bool)] = np.inf
    idx = np.argpartition(a.ravel(),range(N))[:N]
    r,c = np.unravel_index(idx, a.shape)
    return list(zip(dist_df.index[r], dist_df.columns[c]))

Sample runs -

In [43]: dist_df
Out[43]: 
          A         B         C         D
A  0.000000  3.162278  7.211103  7.071068
B  3.162278  0.000000  5.099020  4.472136
C  7.211103  5.099020  0.000000  1.414214
D  7.071068  4.472136  1.414214  0.000000

In [44]: topN_index_columns_from_symmmdist(df, N=2)
Out[44]: [('C', 'D'), ('A', 'B')]

In [45]: topN_index_columns_from_symmmdist(df, N=4)
Out[45]: [('C', 'D'), ('A', 'B'), ('B', 'D'), ('B', 'C')]

Upvotes: 4

Related Questions