Reputation: 11512
I am trying to compute the distance between vectors in two pandas dataframes using cdist
from scipy.spatial.distance
, but the output is all wrong and I can't pinpoint where is fails.
So, My original dataframes are of the type:
df_sample =
Fingerprint
1272 [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
657 [1.44, 12.0, 10.0, 5.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.23, 4.36, 15.0]
806 [4.58, 13.09, 15.46, 3.59, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 6.31]
and
DF =
barcode \
4538 A4060462000516278
5043 A4050494272716275
11663 A4070271111316245
2701 A4060462848716270
825 A4060454573516274
8679 A4060462010016274
11700 A4060462080916270
8594 A4060461067716272
8707 A4060454363916275
1071 A4060463723916275
Geopos Ack
4538 [0.0, 0.0, 0.0, 0.0, 6.0, 15.0, 16.0, 0.0, 0.0, 5.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5, 0.0, 3.0]
5043 [0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 12.0, 0.0, 13.0, 15.0, 0.0, 15.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 0.0]
11663 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 15.0, 0.0, 0.0, 0.0, 6.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2701 [0.0, 0.0, 0.0, 8.0, 13.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 0.0, 7.0]
825 [0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 15.0, 0.0, 13.0, 16.0, 0.0, 9.0, 3.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
8679 [0.0, 4.0, 9.0, 15.0, 10.0, 3.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 9.0]
11700 [0.0, 0.0, 6.0, 0.0, 15.0, 8.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 6.0]
8594 [12.0, 16.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 5.0]
8707 [7.0, 5.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0, 15.0]
1071 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
(I provide dictionaries for both at the end of the question).
As you can see, they are of different dimension (although the vectors belong to the same space). So, to remedy this I create zero vectors in df_sample
by doing this:
Number_AP = 26
number_zero_vectors = len(DF)-len(df_sample)
df =pd.DataFrame(columns = ['Fingerprint'])
for k in range(number_zero_vectors):
a = zerolistmaker(Number_AP)
df = df.append({'Fingerprint':a},ignore_index=True)
df_sample_ = pd.concat([df_sample, df])
Hence, DF
and df_sample_
have the same shape. However, the dtype
och both df_sample_['Fingerprint']
and DF['Geopos Ack']
are object
, that is they are both lists. So, I need to make them into arrays. The results are arrays of array:
Ax = df_sample_['Fingerprint'] = df_sample_['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF['Geopos Ack'] = DF['Geopos Ack'].apply(lambda x: np.array(x))
and I therefore need to 1) make them into arrays (of vectors) and 2) make sure they have the same shape to use cdist
,
A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.concatenate(A, axis=0).reshape(-1,1)
BB = np.concatenate(B, axis=0).reshape(-1,1)
In short, I wish to compute the distances distance between every pair of vectors (a, b) where a is a vector in A and b is a vector in B.
For instance:
A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = [[1, 2, 2^0.5], [1, 2^0.5, 2]]
So, to compute the distances I use the following full code:
import scipy.spatial.distance as sp
Ax = df_sample_['Fingerprint'] = df_sample_['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF['Geopos Ack'] = DF['Geopos Ack'].apply(lambda x: np.array(x))
A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.concatenate(A, axis=0).reshape(-1,1)
BB = np.concatenate(B, axis=0).reshape(-1,1)
d = sp.cdist(AA,BB, 'euclidean')
But this returns
array([[0., 0., 0., ..., 0., 0., 0.],
[4., 4., 4., ..., 4., 4., 4.],
[8., 8., 8., ..., 8., 8., 8.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
which is the concatenation of all the arrays in df_sample_
.
Where did I go wrong? I know another approach would be to use pairwise_distance
from sklearn
but I did not manage to apply it to my dataframes.
Any help would be appreciated.
Data:
df_sample =
{'Fingerprint': {1272: [0.0,
4.0,
8.0,
15.0,
10.0,
8.0,
2.54,
2.0,
4.91,
0.0,
0.0,
0.0,
0.0,
3.59,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.0],
657: [1.44,
12.0,
10.0,
5.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.23,
4.36,
15.0],
806: [4.58,
13.09,
15.46,
3.59,
3.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
6.31]}}
and
DF =
{'barcode': {4538: 'A4060462000516278',
5043: 'A4050494272716275',
11663: 'A4070271111316245',
2701: 'A4060462848716270',
825: 'A4060454573516274',
8679: 'A4060462010016274',
11700: 'A4060462080916270',
8594: 'A4060461067716272',
8707: 'A4060454363916275',
1071: 'A4060463723916275'},
'Geopos Ack': {4538: [0.0,
0.0,
0.0,
0.0,
6.0,
15.0,
16.0,
0.0,
0.0,
5.0,
0.0,
15.0,
0.0,
0.0,
0.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.5,
0.0,
3.0],
5043: [0.0,
0.0,
0.0,
0.0,
0.0,
16.0,
12.0,
0.0,
13.0,
15.0,
0.0,
15.0,
0.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
3.0,
3.0,
0.0],
11663: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
5.0,
15.0,
0.0,
0.0,
0.0,
6.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
2701: [0.0,
0.0,
0.0,
8.0,
13.0,
16.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
6.0,
0.0,
7.0],
825: [0.0,
0.0,
0.0,
0.0,
0.0,
11.0,
15.0,
0.0,
13.0,
16.0,
0.0,
9.0,
3.0,
0.0,
6.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0],
8679: [0.0,
4.0,
9.0,
15.0,
10.0,
3.0,
2.0,
0.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
9.0],
11700: [0.0,
0.0,
6.0,
0.0,
15.0,
8.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
16.0,
0.0,
6.0],
8594: [12.0,
16.0,
16.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
8.0,
0.0,
5.0],
8707: [7.0,
5.0,
2.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
2.0,
8.0,
15.0],
1071: [0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
12.0,
15.5,
6.0,
3.5,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0,
0.0]}}
Upvotes: 1
Views: 510
Reputation: 2002
As mentioned in scipy.spatial.distance's docs, XA and XB are supposed to be lists of the vectors of which you want to find the distance from one to the others. What you did in your code is make one long vector from all the vectors and comapare them when what i think you had to do was stack them. Although your exact intentions were not clear in your question, so i might be wrong.
import pandas as pd
import numpy as np
import scipy.spatial.distance as sp
# df_sample and DF are OP's dictionaries
df_sample_df = pd.DataFrame(df_sample)
DF_df = pd.DataFrame(DF)
Ax = df_sample_df['Fingerprint'] = df_sample_df['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF_df['Geopos Ack'] = DF_df['Geopos Ack'].apply(lambda x: np.array(x))
A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)
d = sp.cdist(AA,BB, 'euclidean')
print(f'd.shape = {d.shape}')
print(f'd[0, 0] = {d[0, 0]}')
print(f'L2(AA[0],BB[0]) = {np.sum((AA[0] - BB[0])**2)**0.5}')
output:
d.shape = (3, 10)
d[0, 0] = 34.57536840006191
L2(AA[0],BB[0]) = 34.57536840006192
To make your question clearer, you can explain what are the distances you want to calculate, as well as add a MINIMAL reproducible example. Such as:
"I want to find the distance between every pair of vectors (a, b) where a is a vector in A and b is a vector in B.
A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = [[1, 2, 2^0.5], [1, 2^0.5, 2]]
"
Or:
"I want to find the Frobenius norm of the difference between the padded matrix A and the matrix B.
A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = 8^0.5
"
Upvotes: 1