Reputation: 1034
I searched a bit around and found comparable questions/answers, but none of them returned the correct results for me.
Situation:
I have an array with a number of clumps of values == 1, while the rest of the cells are set to zero. Each cell is a square (width=height).
Now I want to calculate the average distance between all 1 values.
The formula should be like this: d = sqrt ( (( x2 - x1 )*size)**2 + (( y2 - y1 )*size)**2 )
Example:
import numpy as np
from scipy.spatial.distance import pdist
a = np.array([[1, 0, 1],
[0, 0, 0],
[0, 0, 1]])
# Given that each cell is 10m wide/high
val = 10
d = pdist(a, lambda u, v: np.sqrt( ( ((u-v)*val)**2).sum() ) )
d
array([ 14.14213562, 10. , 10. ])
After that I would calculate the average via d.mean()
. However the result in d is obviously wrong as the distance between the cells in the top-row should be 20 already (two crossed cells * 10). Is there something wrong with my formula, math or approach?
Upvotes: 2
Views: 2517
Reputation: 13459
You need the actual coordinates of the non-zero markers, to compute the distance between them:
>>> import numpy as np
>>> from scipy.spatial.distance import squareform, pdist
>>> a = np.array([[1, 0, 1],
... [0, 0, 0],
... [0, 0, 1]])
>>> np.where(a)
(array([0, 0, 2]), array([0, 2, 2]))
>>> x,y = np.where(a)
>>> coords = np.vstack((x,y)).T
>>> coords
array([[0, 0], # That's the coordinate of the "1" in the top left,
[0, 2], # top right,
[2, 2]]) # and bottom right.
Next you want to calculate the distance between these points. You use pdist
for this, like so:
>>> dists = pdist(coords) * 10 # Uses the Euclidean distance metric by default.
>>> squareform(dists)
array([[ 0. , 20. , 28.28427125],
[ 20. , 0. , 20. ],
[ 28.28427125, 20. , 0. ]])
In this last matrix, you will find (above the diagonal), the distance between each marked point in a
and another coordinate. In this case, you had 3 coordinates, so it gives you the distance between node 0 (a[0,0]
) and node 1 (a[0,2]
), node 0 and node 2 (a[2,2]
) and finally between node 1 and node 2. To put it in different words, if S = squareform(dists)
, then S[i,j]
returns the distance between the coordinates on row i
of coords
and row j
.
Just the values in the upper triangle of that last matrix are also present in the variable dist
, from which you can derive the mean easily, without having to perform the relatively expensive calculation of the squareform
(shown here just for demonstration purposes):
>>> dists
array([ 20. , 28.2842712, 20. ])
>>> dists.mean()
22.761423749153966
Remark that your computed solution "looks" nearly correct (aside from a factor of 2), because of the example you chose. What pdist
does, is it takes the Euclidean distance between the first point in the n-dimensional space and the second and then between the first and the third and so on. In your example, that means, it computes the distance between a point on row 0: that point has coordinates in 3 dimensional space given by [1,0,1]
. The 2nd point is [0,0,0]
. The Euclidean distance between those two sqrt(2)~1.4
. Then, the distance between the first and the 3rd coordinate (the last row in a
), is only 1
. Finally, the distance between the 2nd coordinate (row 1: [0,0,0]
) and the 3rd (last row, row 2: [0,0,1]
) is also 1
. So remember, pdist
interprets its first argument as a stack of coordinates in n-dimensional space, n
being the number of elements in the tuple of each node.
Upvotes: 4