Reputation: 153
Consider the following code, which generates a distance matrix from a list of labeled coordinates:
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
coord_data = [
[1, 2],
[4, 3],
[5, 8],
[6, 7],
]
df = pd.DataFrame(coord_data, index=list('ABCD'))
dist_matrix = squareform(pdist(df, metric='euclidean'))
dist_df = pd.DataFrame(dist_matrix, index=df.index, columns=df.index)
print(dist_df)
A B C D
A 0.000000 3.162278 7.211103 7.071068
B 3.162278 0.000000 5.099020 4.472136
C 7.211103 5.099020 0.000000 1.414214
D 7.071068 4.472136 1.414214 0.000000
Is there an efficient way (using numpy, pandas, etc.) of getting the N minimum distance pairs from this distance matrix?
For instance, if N=2, an output similar to the following is desired for the given example:
[['C', 'D'], ['A', 'B']] # corresponding to minimum distances [1.414214, 3.162278]
Upvotes: 3
Views: 240
Reputation: 153
Here is another answer using pandas, based on the comment of Laurent R, for the sake of completeness. I ended up using Divakar's solution though.
def topN_index_columns_from_symmmdist2(dist_df, N):
dist_df = pd.melt(dist_df.reset_index(), id_vars="index")
dist_df = dist_df.rename(columns={"index": "start", "variable": "end"})
dist_df = dist_df.sort_values("value")
dist_df = dist_df.drop_duplicates(subset=["value"], keep="last")
dist_pair_list = dist_df.iloc[1:N+1, :2].values.tolist()
return dist_pair_list
Sample outputs:
print(topN_index_columns_from_symmmdist2(dist_df, 2))
print(topN_index_columns_from_symmmdist2(dist_df, 4))
[['C', 'D'], ['A', 'B']]
[['C', 'D'], ['A', 'B'], ['B', 'D'], ['B', 'C']]
Upvotes: 1
Reputation: 221564
Here's one with np.argpartition
for perf. efficiency -
def topN_index_columns_from_symmmdist(df, N):
a = dist_df.to_numpy(copy=True)
a[np.tri(len(a), dtype=bool)] = np.inf
idx = np.argpartition(a.ravel(),range(N))[:N]
r,c = np.unravel_index(idx, a.shape)
return list(zip(dist_df.index[r], dist_df.columns[c]))
Sample runs -
In [43]: dist_df
Out[43]:
A B C D
A 0.000000 3.162278 7.211103 7.071068
B 3.162278 0.000000 5.099020 4.472136
C 7.211103 5.099020 0.000000 1.414214
D 7.071068 4.472136 1.414214 0.000000
In [44]: topN_index_columns_from_symmmdist(df, N=2)
Out[44]: [('C', 'D'), ('A', 'B')]
In [45]: topN_index_columns_from_symmmdist(df, N=4)
Out[45]: [('C', 'D'), ('A', 'B'), ('B', 'D'), ('B', 'C')]
Upvotes: 4