Problems computing cdist of two columns in two different dataframes

Question

I am trying to compute the distance between vectors in two pandas dataframes using cdist from scipy.spatial.distance, but the output is all wrong and I can't pinpoint where is fails.

So, My original dataframes are of the type:

df_sample = 
                                             Fingerprint
1272    [0.0, 4.0, 8.0, 15.0, 10.0, 8.0, 2.54, 2.0, 4.91, 0.0, 0.0, 0.0, 0.0, 3.59, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0]
657    [1.44, 12.0, 10.0, 5.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.23, 4.36, 15.0]
806   [4.58, 13.09, 15.46, 3.59, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 6.31]

and

DF = 
  barcode  \
4538   A4060462000516278   
5043   A4050494272716275   
11663  A4070271111316245   
2701   A4060462848716270   
825    A4060454573516274   
8679   A4060462010016274   
11700  A4060462080916270   
8594   A4060461067716272   
8707   A4060454363916275   
1071   A4060463723916275   

                                                                                                                                    Geopos Ack  
4538     [0.0, 0.0, 0.0, 0.0, 6.0, 15.0, 16.0, 0.0, 0.0, 5.0, 0.0, 15.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.5, 0.0, 3.0]  
5043   [0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 12.0, 0.0, 13.0, 15.0, 0.0, 15.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 0.0]  
11663      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 15.0, 0.0, 0.0, 0.0, 6.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
2701      [0.0, 0.0, 0.0, 8.0, 13.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 0.0, 7.0]  
825     [0.0, 0.0, 0.0, 0.0, 0.0, 11.0, 15.0, 0.0, 13.0, 16.0, 0.0, 9.0, 3.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
8679      [0.0, 4.0, 9.0, 15.0, 10.0, 3.0, 2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 9.0]  
11700     [0.0, 0.0, 6.0, 0.0, 15.0, 8.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 6.0]  
8594     [12.0, 16.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 5.0]  
8707       [7.0, 5.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 8.0, 15.0]  
1071      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, 15.5, 6.0, 3.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

(I provide dictionaries for both at the end of the question).

As you can see, they are of different dimension (although the vectors belong to the same space). So, to remedy this I create zero vectors in df_sample by doing this:

Number_AP = 26
number_zero_vectors = len(DF)-len(df_sample)
df =pd.DataFrame(columns = ['Fingerprint'])
for k in range(number_zero_vectors):
    a = zerolistmaker(Number_AP)
    df = df.append({'Fingerprint':a},ignore_index=True)

df_sample_ = pd.concat([df_sample, df])

Hence, DF and df_sample_ have the same shape. However, the dtype och both df_sample_['Fingerprint'] and DF['Geopos Ack'] are object, that is they are both lists. So, I need to make them into arrays. The results are arrays of array:

Ax = df_sample_['Fingerprint'] = df_sample_['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF['Geopos Ack'] = DF['Geopos Ack'].apply(lambda x: np.array(x))

and I therefore need to 1) make them into arrays (of vectors) and 2) make sure they have the same shape to use cdist,

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.concatenate(A, axis=0).reshape(-1,1)
BB = np.concatenate(B, axis=0).reshape(-1,1)

In short, I wish to compute the distances distance between every pair of vectors (a, b) where a is a vector in A and b is a vector in B.

For instance:

A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = [[1, 2, 2^0.5], [1, 2^0.5, 2]]

So, to compute the distances I use the following full code:

import scipy.spatial.distance as sp

Ax = df_sample_['Fingerprint'] = df_sample_['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF['Geopos Ack'] = DF['Geopos Ack'].apply(lambda x: np.array(x))

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.concatenate(A, axis=0).reshape(-1,1)
BB = np.concatenate(B, axis=0).reshape(-1,1)


d = sp.cdist(AA,BB, 'euclidean')

But this returns

array([[0., 0., 0., ..., 0., 0., 0.],
       [4., 4., 4., ..., 4., 4., 4.],
       [8., 8., 8., ..., 8., 8., 8.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

which is the concatenation of all the arrays in df_sample_.

Where did I go wrong? I know another approach would be to use pairwise_distance from sklearn but I did not manage to apply it to my dataframes.

Any help would be appreciated.

Data:

df_sample = 
{'Fingerprint': {1272: [0.0,
   4.0,
   8.0,
   15.0,
   10.0,
   8.0,
   2.54,
   2.0,
   4.91,
   0.0,
   0.0,
   0.0,
   0.0,
   3.59,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   8.0],
  657: [1.44,
   12.0,
   10.0,
   5.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   8.23,
   4.36,
   15.0],
  806: [4.58,
   13.09,
   15.46,
   3.59,
   3.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   0.0,
   6.31]}}

and

DF = 
{'barcode': {4538: 'A4060462000516278',
  5043: 'A4050494272716275',
  11663: 'A4070271111316245',
  2701: 'A4060462848716270',
  825: 'A4060454573516274',
  8679: 'A4060462010016274',
  11700: 'A4060462080916270',
  8594: 'A4060461067716272',
  8707: 'A4060454363916275',
  1071: 'A4060463723916275'},
 'Geopos Ack': {4538: [0.0,
   0.0,
   0.0,
   0.0,
   6.0,
   15.0,
   16.0,
   0.0,
   0.0,
   5.0,
   0.0,
   15.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   3.5,
   0.0,
   3.0],
  5043: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   16.0,
   12.0,
   0.0,
   13.0,
   15.0,
   0.0,
   15.0,
   0.0,
   0.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   3.0,
   3.0,
   0.0],
  11663: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   5.0,
   15.0,
   0.0,
   0.0,
   0.0,
   6.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0],
  2701: [0.0,
   0.0,
   0.0,
   8.0,
   13.0,
   16.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   6.0,
   0.0,
   7.0],
  825: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   11.0,
   15.0,
   0.0,
   13.0,
   16.0,
   0.0,
   9.0,
   3.0,
   0.0,
   6.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0],
  8679: [0.0,
   4.0,
   9.0,
   15.0,
   10.0,
   3.0,
   2.0,
   0.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   9.0],
  11700: [0.0,
   0.0,
   6.0,
   0.0,
   15.0,
   8.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   16.0,
   0.0,
   6.0],
  8594: [12.0,
   16.0,
   16.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   8.0,
   0.0,
   5.0],
  8707: [7.0,
   5.0,
   2.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   2.0,
   8.0,
   15.0],
  1071: [0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   12.0,
   15.5,
   6.0,
   3.5,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0]}}

yann ziselman · Accepted Answer

As mentioned in scipy.spatial.distance's docs, XA and XB are supposed to be lists of the vectors of which you want to find the distance from one to the others. What you did in your code is make one long vector from all the vectors and comapare them when what i think you had to do was stack them. Although your exact intentions were not clear in your question, so i might be wrong.

import pandas as pd
import numpy as np
import scipy.spatial.distance as sp

# df_sample and DF are OP's dictionaries
df_sample_df = pd.DataFrame(df_sample)
DF_df = pd.DataFrame(DF)

Ax = df_sample_df['Fingerprint'] = df_sample_df['Fingerprint'].apply(lambda x: np.array(x))
Bx = DF_df['Geopos Ack'] = DF_df['Geopos Ack'].apply(lambda x: np.array(x))

A = Ax.to_numpy()
B = Bx.to_numpy()
AA = np.stack(A)
BB = np.stack(B)


d = sp.cdist(AA,BB, 'euclidean')
print(f'd.shape = {d.shape}')
print(f'd[0, 0] = {d[0, 0]}')
print(f'L2(AA[0],BB[0]) = {np.sum((AA[0] - BB[0])**2)**0.5}')

output:

d.shape = (3, 10)
d[0, 0] = 34.57536840006191
L2(AA[0],BB[0]) = 34.57536840006192

To make your question clearer, you can explain what are the distances you want to calculate, as well as add a MINIMAL reproducible example. Such as:

"I want to find the distance between every pair of vectors (a, b) where a is a vector in A and b is a vector in B.
A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = [[1, 2, 2^0.5], [1, 2^0.5, 2]] "

Or:

"I want to find the Frobenius norm of the difference between the padded matrix A and the matrix B.
A = [[1, 0], [0, 1]];
B = [[1, 1], [1, 2], [2, 1]];
D = 8^0.5 "

Problems computing cdist of two columns in two different dataframes

Answers (1)

Related Questions