Reputation: 479
I have a dataframe like below and I need to calculate the euclidean distance.
a,b,c,d,e
10,11,13,14,9
11,12,14,15,10
12,13,15,16,11
13,14,16,17,12
14,15,17,18,13
15,16,18,19,14
16,17,19,20,15
17,18,20,21,16
18,19,21,22,17
19,20,22,23,18
20,21,23,24,19
21,22,24,25,20
22,23,25,26,21
23,24,26,27,22
24,25,27,28,23
I guess with only 2 column features say a and b, I can easily do:
def euclidean_distance(a, b):
return np.sqrt(np.sum((a - b)**2))
How can I calculate the euclidean distance of a dataframe with several column features like a, b, c, d, e above?
Upvotes: 1
Views: 2366
Reputation: 394
Your data has (15 dimensions, 5 points), and you want the Euclidean distance between each and every one of those points, if I am not mistaken.
import numpy as np
import pandas as pd
# copied and pasted your data to a text file
df = pd.read_table("euclidean.txt", sep=',')
> df.shape
(15, 5)
(15,5)
Distance matrix will be 5x5
. Initialize this matrix, calculate the Euclidean distance between each of these 5 points using for
loops, and fill them into the distance matrix.
n = df.shape[1] # this number is 5 for the dataset you provided
dm = np.zeros((n,n)) # initialize the distance matrix to zero
for i in range(n):
for j in range(n):
dm[i,j] = np.sqrt(np.sum((df.iloc[:,i] - df.iloc[:,j])**2))
dm
output is then:
> dm
array([[ 0. , 3.87298335, 11.61895004, 15.49193338, 3.87298335],
[ 3.87298335, 0. , 7.74596669, 11.61895004, 7.74596669],
[11.61895004, 7.74596669, 0. , 3.87298335, 15.49193338],
[15.49193338, 11.61895004, 3.87298335, 0. , 19.36491673],
[ 3.87298335, 7.74596669, 15.49193338, 19.36491673, 0. ]])
Upvotes: 1
Reputation: 150725
How about cdist
:
from scipy.spatial.distance import cdist
arr = df[['a','b','c','d']].values
dist_mat = cdist(arr,arr)
If you don't like external package, the distance matrix is:
dist_mat = ((arr[None,:,:] - arr[:,None,:])**2).sum(-1)**.5
Upvotes: 0
Reputation: 1869
If i understand correctly the question, you want to create a distance matrix for all of your rows?
from scipy.spatial.distance import pdist, squareform
df = pd.DataFrame([{'a':1,'b':2,'c':3}, {'a':4,'b':5,'c':6}])
distances = squareform(pdist(df.values, metric='euclidean'))
resulting in a matrix containing
array([[0. , 5.19615242],
[5.19615242, 0. ]])
Upvotes: 1