GKC
GKC

Reputation: 479

Calculating euclidean distance from a dataframe with several column features

I have a dataframe like below and I need to calculate the euclidean distance.

a,b,c,d,e
10,11,13,14,9
11,12,14,15,10
12,13,15,16,11
13,14,16,17,12
14,15,17,18,13
15,16,18,19,14
16,17,19,20,15
17,18,20,21,16
18,19,21,22,17
19,20,22,23,18
20,21,23,24,19
21,22,24,25,20
22,23,25,26,21
23,24,26,27,22
24,25,27,28,23

I guess with only 2 column features say a and b, I can easily do:

def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b)**2))

How can I calculate the euclidean distance of a dataframe with several column features like a, b, c, d, e above?

Upvotes: 1

Views: 2366

Answers (3)

luminare
luminare

Reputation: 394

Your data has (15 dimensions, 5 points), and you want the Euclidean distance between each and every one of those points, if I am not mistaken.

import numpy as np
import pandas as pd

# copied and pasted your data to a text file
df = pd.read_table("euclidean.txt", sep=',') 

> df.shape 
(15, 5)

(15,5) Distance matrix will be 5x5. Initialize this matrix, calculate the Euclidean distance between each of these 5 points using for loops, and fill them into the distance matrix.

n = df.shape[1] # this number is 5 for the dataset you provided
dm = np.zeros((n,n)) # initialize the distance matrix to zero

for i in range(n):
    for j in range(n):
        dm[i,j] = np.sqrt(np.sum((df.iloc[:,i] - df.iloc[:,j])**2))

dm output is then:

> dm
array([[ 0.        ,  3.87298335, 11.61895004, 15.49193338,  3.87298335],
       [ 3.87298335,  0.        ,  7.74596669, 11.61895004,  7.74596669],
       [11.61895004,  7.74596669,  0.        ,  3.87298335, 15.49193338],
       [15.49193338, 11.61895004,  3.87298335,  0.        , 19.36491673],
       [ 3.87298335,  7.74596669, 15.49193338, 19.36491673,  0.        ]])

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150725

How about cdist:

from scipy.spatial.distance import cdist
arr = df[['a','b','c','d']].values
dist_mat = cdist(arr,arr)

If you don't like external package, the distance matrix is:

dist_mat = ((arr[None,:,:] - arr[:,None,:])**2).sum(-1)**.5

Upvotes: 0

J-H
J-H

Reputation: 1869

If i understand correctly the question, you want to create a distance matrix for all of your rows?

from scipy.spatial.distance import pdist, squareform
df = pd.DataFrame([{'a':1,'b':2,'c':3}, {'a':4,'b':5,'c':6}])
distances = squareform(pdist(df.values, metric='euclidean'))

resulting in a matrix containing

array([[0.        , 5.19615242],
   [5.19615242, 0.        ]])

Upvotes: 1

Related Questions