Curlew
Curlew

Reputation: 1034

Calculate average weighted euclidean distance between values in numpy

I searched a bit around and found comparable questions/answers, but none of them returned the correct results for me.

Situation: I have an array with a number of clumps of values == 1, while the rest of the cells are set to zero. Each cell is a square (width=height). Now I want to calculate the average distance between all 1 values. The formula should be like this: d = sqrt ( (( x2 - x1 )*size)**2 + (( y2 - y1 )*size)**2 )

Example:

import numpy as np
from scipy.spatial.distance import pdist

a = np.array([[1, 0, 1],
              [0, 0, 0],
              [0, 0, 1]])

# Given that each cell is 10m wide/high
val = 10
d = pdist(a, lambda u, v: np.sqrt( ( ((u-v)*val)**2).sum() ) )
d
array([ 14.14213562,  10.        ,  10.        ])

After that I would calculate the average via d.mean(). However the result in d is obviously wrong as the distance between the cells in the top-row should be 20 already (two crossed cells * 10). Is there something wrong with my formula, math or approach?

Upvotes: 2

Views: 2517

Answers (1)

Oliver W.
Oliver W.

Reputation: 13459

You need the actual coordinates of the non-zero markers, to compute the distance between them:

>>> import numpy as np
>>> from scipy.spatial.distance import squareform, pdist
>>> a = np.array([[1, 0, 1],
...               [0, 0, 0],
...               [0, 0, 1]])
>>> np.where(a)
(array([0, 0, 2]), array([0, 2, 2]))
>>> x,y = np.where(a)
>>> coords = np.vstack((x,y)).T
>>> coords
array([[0, 0],   # That's the coordinate of the "1" in the top left,
       [0, 2],   # top right,
       [2, 2]])  # and bottom right.

Next you want to calculate the distance between these points. You use pdist for this, like so:

>>> dists = pdist(coords) * 10  # Uses the Euclidean distance metric by default.
>>> squareform(dists)
array([[  0.        ,  20.        ,  28.28427125],
       [ 20.        ,   0.        ,  20.        ],
       [ 28.28427125,  20.        ,   0.        ]])

In this last matrix, you will find (above the diagonal), the distance between each marked point in a and another coordinate. In this case, you had 3 coordinates, so it gives you the distance between node 0 (a[0,0]) and node 1 (a[0,2]), node 0 and node 2 (a[2,2]) and finally between node 1 and node 2. To put it in different words, if S = squareform(dists), then S[i,j] returns the distance between the coordinates on row i of coords and row j.

Just the values in the upper triangle of that last matrix are also present in the variable dist, from which you can derive the mean easily, without having to perform the relatively expensive calculation of the squareform (shown here just for demonstration purposes):

>>> dists
array([ 20.        ,  28.2842712,  20.        ])
>>> dists.mean()
22.761423749153966

Remark that your computed solution "looks" nearly correct (aside from a factor of 2), because of the example you chose. What pdist does, is it takes the Euclidean distance between the first point in the n-dimensional space and the second and then between the first and the third and so on. In your example, that means, it computes the distance between a point on row 0: that point has coordinates in 3 dimensional space given by [1,0,1]. The 2nd point is [0,0,0]. The Euclidean distance between those two sqrt(2)~1.4. Then, the distance between the first and the 3rd coordinate (the last row in a), is only 1. Finally, the distance between the 2nd coordinate (row 1: [0,0,0]) and the 3rd (last row, row 2: [0,0,1]) is also 1. So remember, pdist interprets its first argument as a stack of coordinates in n-dimensional space, n being the number of elements in the tuple of each node.

Upvotes: 4

Related Questions