Reputation: 389
I am new to Python so this question might look trivia. However, I did not find a similar case to mine. I have a matrix of coordinates for 20 nodes. I want to compute the euclidean distance between all pairs of nodes from this set and store them in a pairwise matrix. For example, If I have 20 nodes, I want the end result to be a matrix of (20,20) with values of euclidean distance between each pairs of nodes. I tried to used a for loop to go through each element of the coordinate set and compute euclidean distance as follows:
ncoord=numpy.matrix('3225 318;2387 989;1228 2335;57 1569;2288 8138;3514 2350;7936 314;9888 4683;6901 1834;7515 8231;709 3701;1321 8881;2290 2350;5687 5034;760 9868;2378 7521;9025 5385;4819 5943;2917 9418;3928 9770')
n=20
c=numpy.zeros((n,n))
for i in range(0,n):
for j in range(i+1,n):
c[i][j]=math.sqrt((ncoord[i][0]-ncoord[j][0])**2+(ncoord[i][1]-ncoord[j][1])**2)
How ever, I am getting an error of "input must be a square array ". I wonder if anybody knows what is happening here. Thanks
Upvotes: 12
Views: 17663
Reputation: 74252
There are much, much faster alternatives to using nested for
loops for this. I'll show you two different approaches - the first will be a more general method that will introduce you to broadcasting and vectorization, and the second uses a more convenient scipy library function.
One of the first things I'd suggest doing is switching to using np.array
rather than np.matrix
. Arrays are preferred for a number of reasons, most importantly because they can have >2 dimensions, and they make element-wise multiplication much less awkward.
import numpy as np
ncoord = np.array(ncoord)
With an array, we can eliminate the nested for
loops by inserting a new singleton dimension and broadcasting the subtraction over it:
# indexing with None (or np.newaxis) inserts a new dimension of size 1
print(ncoord[:, :, None].shape)
# (20, 2, 1)
# by making the 'inner' dimensions equal to 1, i.e. (20, 2, 1) - (1, 2, 20),
# the subtraction is 'broadcast' over every pair of rows in ncoord
xydiff = ncoord[:, :, None] - ncoord[:, :, None].T
print(xydiff.shape)
# (20, 2, 20)
This is equivalent to looping over every pair of rows using nested for loops, but much, much faster!
xydiff2 = np.zeros((20, 2, 20), dtype=xydiff.dtype)
for ii in range(20):
for jj in range(20):
for kk in range(2):
xydiff[ii, kk, jj] = ncoords[ii, kk] - ncoords[jj, kk]
# check that these give the same result
print(np.all(xydiff == xydiff2))
# True
The rest we can also do using vectorized operations:
# we square the differences and sum over the 'middle' axis, equivalent to
# computing (x_i - x_j) ** 2 + (y_i - y_j) ** 2
ssdiff = (xydiff * xydiff).sum(1)
# finally we take the square root
D = np.sqrt(ssdiff)
The whole thing could be done in one line like this:
D = np.sqrt(((ncoord[:, :, None] - ncoord[:, :, None].T) ** 2).sum(1))
pdist
It turns out that there's already a fast and convenient function for computing all pairwise distances: scipy.spatial.distance.pdist
.
from scipy.spatial.distance import pdist, squareform
d = pdist(ncoord)
# pdist just returns the upper triangle of the pairwise distance matrix. to get
# the whole (20, 20) array we can use squareform:
print(d.shape)
# (190,)
D2 = squareform(d)
print(D2.shape)
# (20, 20)
# check that the two methods are equivalent
print np.all(D == D2)
# True
Upvotes: 29
Reputation: 1815
Using your own custom sqrt sum sqaures is not always safe, they can overflow or underflow. Speed wise they are same
np.hypot(
np.subtract.outer(x, x),
np.subtract.outer(y, y)
)
i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0
i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf
i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200
i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200
Upvotes: 0
Reputation: 10219
for i in range(0, n):
for j in range(i+1, n):
c[i, j] = math.sqrt((ncoord[i, 0] - ncoord[j, 0])**2
+ (ncoord[i, 1] - ncoord[j, 1])**2)
Note: ncoord[i, j]
is not the same as ncoord[i][j]
for a Numpy matrix. This appears to be the source of confusion. If ncoord
is a Numpy array then they will give the same result.
For a Numpy matrix, ncoord[i]
returns the ith row of ncoord
, which itself is a Numpy matrix object with shape 1 x 2 in your case. Therefore, ncoord[i][j]
actually means: take the ith row of ncoord
and take the jth row of that 1 x 2 matrix. This is where your indexing problems comes about when j
> 0.
Regarding your comments on assigning to c[i][j]
"working", it shouldn't. At least on my build of Numpy 1.9.1 it shouldn't work if your indices i
and j
iterates up to n
.
As an aside, remember to add the transpose of the matrix c
to itself.
It is recommended to use Numpy arrays instead of matrix. See this post.
If your coordinates are stored as a Numpy array, then pairwise distance can be computed as:
from scipy.spatial.distance import pdist
pairwise_distances = pdist(ncoord, metric="euclidean", p=2)
or simply
pairwise_distances = pdist(ncoord)
since the default metric is "euclidean", and default "p" is 2.
In a comment below I mistakenly mentioned that the result of pdist is a n x n matrix. To get a n x n matrix, you will need to do the following:
from scipy.spatial.distance import pdist, squareform
pairwise_distances = squareform(pdist(ncoord))
or
from scipy.spatial.distance import cdist
pairwise_distances = cdist(ncoord, ncoord)
Upvotes: 5
Reputation: 1090
What I figure you wanted to do: You said you wanted a 20 by 20 matrix... but the one you coded is triangular.
Thus I coded a complete 20x20 matrix instead.
distances = []
for i in range(len(ncoord)):
given_i = []
for j in range(len(ncoord)):
d_val = math.sqrt((ncoord[i, 0]-ncoord[j,0])**2+(ncoord[i,1]-ncoord[j,1])**2)
given_i.append(d_val)
distances.append(given_i)
# distances[i][j] = distance from i to j
SciPy way:
from scipy.spatial.distance import cdist
# Isn't scipy nice - can also use pdist... works in the same way but different recall method.
distances = cdist(ncoord, ncoord, 'euclidean')
Upvotes: 1