Reputation: 41
All the answers I have found already deal with the opposite problem of generating a pairwise distance matrix from a list of pairwise distances. Assuming I have a full pairwise distance matrix in the format:
1 2 3 4 5
1 0.000 1.154 1.235 1.297 0.960
2 1.154 0.000 0.932 0.929 0.988
3 1.235 0.932 0.000 0.727 1.244
4 1.297 0.929 0.727 0.000 1.019
5 0.960 0.988 1.244 1.019 0.000
And in the end I need a pairwise distance list like:
1 2 1.154
1 3 1.235
1 4 1.297
1 5 0.960
2 3 0.932
... ... ...
How would I proceed? I haven't found a function yet to "slice" the dataframe above the diagonal of 0s. Also it would be great if you could hint at me how well this scales with matrix size? Because the original distance matrix is about 100,000x100,000 items big.
Many, many thanks in advance!
I figured that I can use pandas dataframe.stack()
function to get the list. But in that case I get double values for all distances as the matrix is symmetrical and I get also the distances of 0 between equal items.
Upvotes: 1
Views: 320
Reputation: 262284
Use numpy's triu_indices_from
:
a = df.to_numpy()
idx, col = np.triu_indices_from(a, k=1)
out = pd.DataFrame({'index': df.index[idx],
'columns': df.columns[col],
'value': a[idx, col]})
Output:
index columns value
0 1 2 1.154
1 1 3 1.235
2 1 4 1.297
3 1 5 0.960
4 2 3 0.932
5 2 4 0.929
6 2 5 0.988
7 3 4 0.727
8 3 5 1.244
9 4 5 1.019
Upvotes: 3
Reputation: 120539
Use numpy
:
>>> df.values[np.triu_indices(df.shape[0], k=1)]
array([1.154, 1.235, 1.297, 0.96 , 0.932, 0.929, 0.988, 0.727, 1.244,
1.019])
Upvotes: 0