Jeremy
Jeremy

Reputation: 829

Creating a Distance Matrix?

I am currently reading in data into a dataframe that looks like this.

City         XCord    YCord   
Boston         5        2
Phoenix        7        3
New York       8        1
.....          .        .

I want to to create a Euclidean Distance Matrix from this data showing the distance between all city pairs so I get a resulting matrix like:

             Boston    Phoenix   New York
Boston         0        2.236      3.162
Phoenix        2.236      0        2.236
New York       3.162    2.236        0

There are many more cities and coordinates in my actual data frame so i need to to be able to somehow iterate over all of the city pairs and create a distance matrix like the one I have shown above but I am not sure how to pair all of the cites together and apply the Euclidean Distance formula? Any help would be appreciated.

Upvotes: 40

Views: 74617

Answers (7)

user2314737
user2314737

Reputation: 29307

This is a pure Python and numpy solution for generating a distance matrix.

Redundant computations can skipped (since distance is symmetric, distance(a,b) is the same as distance(b,a) and there's no need to compute the distance twice).

data = [[5, 7], [7, 3], [8, 1]]
cities = ['Boston', 'Phoenix', 'New York']

# Euclidean distance between two points
from math import sqrt
dist = lambda a,b: sqrt((a[0]-b[0])**2+(a[1]-b[1])**2)

import numpy as np
n = len(data)
dist_matrix = np.zeros((n,n))    # initialize distance matrix to a square of zeros
for i in range(n):
    for j in range(i, n):
        dist_matrix[i,j] = dist(data[i], data[j])
        dist_matrix[j,i] = dist_matrix[i,j]       # for the symmetric part, no computation

Now dist_matrix[i,j] is the distance between city[i] and city[j].

Upvotes: 2

Andrew
Andrew

Reputation: 3871

I think you are intrested in distance_matrix.

For example:

Create data:

import pandas as pd
from scipy.spatial import distance_matrix
    
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

Output:

          xcord ycord
Boston      5   7
Phoenix     7   3
New York    8   1

Using the distance matrix function:

 pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)

Results:

          Boston    Phoenix     New York
Boston    0.000000  4.472136    6.708204
Phoenix   4.472136  0.000000    2.236068
New York  6.708204  2.236068    0.000000

Upvotes: 56

eroot163pi
eroot163pi

Reputation: 1815

Refer

import pandas as pd
import numpy as np

data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)
x, y = df.xcord.to_numpy(), df.ycord.to_numpy()
x_y = df.values
%%timeit
pd.DataFrame(
    np.hypot(
        np.subtract.outer(x, x),
        np.subtract.outer(y, y)
    ),
    index=df.index, columns=df.index
)
# 32.9 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
pd.DataFrame(distance_matrix(x_y, x_y), index=df.index, columns=df.index)
# 49.8 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Also compared to normal custom written sqrt methods, hypot is more resistant to overflows and underflows

Underflow

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

Overflow

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

No Underflow

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

No Overflow

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200

Upvotes: 0

Surya Gaur
Surya Gaur

Reputation: 11

data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

n_df=(df.values)
n_df

(df.values).shape

matrix=np.zeros(((df.values).shape[0],(df.values).shape[0]))
matrix


for i in range((df.values).shape[0]):
    for j in range((df.values).shape[0]):
        matrix[i,j]=np.sqrt(np.sum((n_df[i]-n_df[j])**2))
        #print('i',i,'j',j)


print(matrix)

Upvotes: 1

francesco lc
francesco lc

Reputation: 159

if you don't want to use scipy you can exploit list comprehension in this way:

dist = lambda p1, p2: sqrt(((p1-p2)**2).sum())
dm = np.asarray([[dist(p1, p2) for p2 in xy_list] for p1 in xy_list])

Upvotes: 15

Maassa
Maassa

Reputation: 11

There's the function in scipy: scipy.spatial.distance.cdist()

Upvotes: 0

pkacprzak
pkacprzak

Reputation: 5629

I will give a method in pure python.

Import a sqrt function from math module:

from math import sqrt

Let assume that you have your coordinates in cords table in the following way:

cords['Boston'] = (5, 2)

Define a function to compute Euclidean distance of two given 2d points:

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1])

Initialize the resulting matrix as a dictionary:

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)

D is your resulting matrix

The full source is below along with printed result:

from math import sqrt

cords = {}
cords['Boston'] = (5, 2)
cords['Phoenix'] = (7, 3)
cords['New York'] = (8, 1)

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1]) 

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)   

for city1, v in D.items():
    for city2, d in v.items():
        print city1, city2, d

Results:

Boston Boston 0.0
Boston New York 3.16227766017
Boston Phoenix 2.2360679775
New York Boston 3.16227766017
New York New York 0.0
New York Phoenix 2.2360679775
Phoenix Boston 2.2360679775
Phoenix New York 2.2360679775
Phoenix Phoenix 0.0

Upvotes: 8

Related Questions