Reputation: 829
I am currently reading in data into a dataframe that looks like this.
City XCord YCord
Boston 5 2
Phoenix 7 3
New York 8 1
..... . .
I want to to create a Euclidean Distance Matrix from this data showing the distance between all city pairs so I get a resulting matrix like:
Boston Phoenix New York
Boston 0 2.236 3.162
Phoenix 2.236 0 2.236
New York 3.162 2.236 0
There are many more cities and coordinates in my actual data frame so i need to to be able to somehow iterate over all of the city pairs and create a distance matrix like the one I have shown above but I am not sure how to pair all of the cites together and apply the Euclidean Distance formula? Any help would be appreciated.
Upvotes: 40
Views: 74617
Reputation: 29307
This is a pure Python and numpy solution for generating a distance matrix.
Redundant computations can skipped (since distance is symmetric, distance(a,b) is the same as distance(b,a) and there's no need to compute the distance twice).
data = [[5, 7], [7, 3], [8, 1]]
cities = ['Boston', 'Phoenix', 'New York']
# Euclidean distance between two points
from math import sqrt
dist = lambda a,b: sqrt((a[0]-b[0])**2+(a[1]-b[1])**2)
import numpy as np
n = len(data)
dist_matrix = np.zeros((n,n)) # initialize distance matrix to a square of zeros
for i in range(n):
for j in range(i, n):
dist_matrix[i,j] = dist(data[i], data[j])
dist_matrix[j,i] = dist_matrix[i,j] # for the symmetric part, no computation
Now dist_matrix[i,j]
is the distance between city[i]
and city[j]
.
Upvotes: 2
Reputation: 3871
I think you are intrested in distance_matrix.
For example:
Create data:
import pandas as pd
from scipy.spatial import distance_matrix
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)
Output:
xcord ycord
Boston 5 7
Phoenix 7 3
New York 8 1
Using the distance matrix function:
pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)
Results:
Boston Phoenix New York
Boston 0.000000 4.472136 6.708204
Phoenix 4.472136 0.000000 2.236068
New York 6.708204 2.236068 0.000000
Upvotes: 56
Reputation: 1815
import pandas as pd
import numpy as np
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)
x, y = df.xcord.to_numpy(), df.ycord.to_numpy()
x_y = df.values
%%timeit
pd.DataFrame(
np.hypot(
np.subtract.outer(x, x),
np.subtract.outer(y, y)
),
index=df.index, columns=df.index
)
# 32.9 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
pd.DataFrame(distance_matrix(x_y, x_y), index=df.index, columns=df.index)
# 49.8 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Also compared to normal custom written sqrt methods, hypot is more resistant to overflows and underflows
i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0
i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf
i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200
i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200
Upvotes: 0
Reputation: 11
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)
n_df=(df.values)
n_df
(df.values).shape
matrix=np.zeros(((df.values).shape[0],(df.values).shape[0]))
matrix
for i in range((df.values).shape[0]):
for j in range((df.values).shape[0]):
matrix[i,j]=np.sqrt(np.sum((n_df[i]-n_df[j])**2))
#print('i',i,'j',j)
print(matrix)
Upvotes: 1
Reputation: 159
if you don't want to use scipy you can exploit list comprehension in this way:
dist = lambda p1, p2: sqrt(((p1-p2)**2).sum())
dm = np.asarray([[dist(p1, p2) for p2 in xy_list] for p1 in xy_list])
Upvotes: 15
Reputation: 5629
I will give a method in pure python.
Import a sqrt function from math module:
from math import sqrt
Let assume that you have your coordinates in cords table in the following way:
cords['Boston'] = (5, 2)
Define a function to compute Euclidean distance of two given 2d points:
def dist(a, b):
d = [a[0] - b[0], a[1] - b[1]]
return sqrt(d[0] * d[0] + d[1] * d[1])
Initialize the resulting matrix as a dictionary:
D = {}
for city1, cords1 in cords.items():
D[city1] = {}
for city2, cords2 in cords.items():
D[city1][city2] = dist(cords1, cords2)
D is your resulting matrix
The full source is below along with printed result:
from math import sqrt
cords = {}
cords['Boston'] = (5, 2)
cords['Phoenix'] = (7, 3)
cords['New York'] = (8, 1)
def dist(a, b):
d = [a[0] - b[0], a[1] - b[1]]
return sqrt(d[0] * d[0] + d[1] * d[1])
D = {}
for city1, cords1 in cords.items():
D[city1] = {}
for city2, cords2 in cords.items():
D[city1][city2] = dist(cords1, cords2)
for city1, v in D.items():
for city2, d in v.items():
print city1, city2, d
Results:
Boston Boston 0.0
Boston New York 3.16227766017
Boston Phoenix 2.2360679775
New York Boston 3.16227766017
New York New York 0.0
New York Phoenix 2.2360679775
Phoenix Boston 2.2360679775
Phoenix New York 2.2360679775
Phoenix Phoenix 0.0
Upvotes: 8