Reputation:
I have 2 arrays in 2D, where the column vectors are feature vectors. One array is of size F x A, the other of F x B, where A << B. As an example, for A = 2 and F = 3 (B can be anything):
arr1 = np.array( [[1, 4],
[2, 5],
[3, 6]] )
arr2 = np.array( [[1, 4, 7, 10, ..],
[2, 5, 8, 11, ..],
[3, 6, 9, 12, ..]] )
I want to calculate the distance between arr1
and a fragment of arr2
that is of equal size (in this case, 3x2), for each possible fragment of arr2
. The column vectors are independent of each other, so I believe I should calculate the distance between each column vector in arr1
and a collection of column vectors ranging from i
to i + A
from arr2
and take the sum of these distances (not sure though).
Does numpy offer an efficient way of doing this, or will I have to take slices from the second array and, using another loop, calculate the distance between each column vector in arr1
and the corresponding column vector in the slice?
Example for clarity, using the arrays stated above:
>>> magical_distance_func(arr1, arr2[:,:2])
[0, 10.3923..]
>>> # First, distance between arr2[:,:2] and arr1, which equals 0.
>>> # Second, distance between arr2[:,1:3] and arr1, which equals
>>> diff = arr1 - np.array( [[4,7],[5,8],[6,9]] )
>>> diff
[[-3, -3], [-3, -3], [-3, -3]]
>>> # this happens to consist only of -3's. Norm of each column vector is:
>>> norm1 = np.linalg.norm([:,0])
>>> norm2 = np.linalg.norm([:,1])
>>> # would be extremely good if this worked for an arbitrary number of norms
>>> totaldist = norm1 + norm2
>>> totaldist
10.3923...
Of course, transposing the arrays is fine too, if that means that cdist can somehow be used here.
Upvotes: 8
Views: 6202
Reputation: 150947
If I understand your question correctly, this will work. Knowing numpy
, there's probably a better way, but this is at least fairly straightforward. I used some contrived coordinates to show that the calculation is working as expected.
>>> arr1
array([[0, 3],
[1, 4],
[2, 5]])
>>> arr2
array([[ 3, 6, 5, 8],
[ 5, 8, 13, 16],
[ 2, 5, 2, 5]])
You can subtract arr1
from arr2
by ensuring that they broadcast against each other correctly. The best way I could think of involves taking a transpose and doing some reshaping. These don't create copies -- they create views -- so this isn't so wasteful. (dist
is a copy though.)
>>> dist = (arr2.T.reshape((2, 2, 3)) - arr1.T).reshape((4, 3))
>>> dist
array([[ 3, 4, 0],
[ 3, 4, 0],
[ 5, 12, 0],
[ 5, 12, 0]])
Now all we have to do is apply numpy.linalg.norm
across axis 1. (You can select from among several norms).
>>> numpy.apply_along_axis(numpy.linalg.norm, 1, dist)
array([ 5., 5., 13., 13.])
Assuming you want simple euclidean distance, you can also do it directly; not sure whether this will be faster or slower so try both:
>>> (dist ** 2).sum(axis=1) ** 0.5
array([ 5., 5., 13., 13.])
Based on your edit, we have to do only one small tweak. Since you want to test the columns pairwise, rather than blockwise, you need a rolling window. This can be done very simply with fairly straightforward indexing:
>>> arr2.T[numpy.array(zip(range(0, 3), range(1, 4)))]
array([[[ 3, 5, 2],
[ 6, 8, 5]],
[[ 6, 8, 5],
[ 5, 13, 2]],
[[ 5, 13, 2],
[ 8, 16, 5]]])
Combining that with the other tricks:
>>> arr2_pairs = arr2.T[numpy.array(zip(range(0, 3), range(1, 4)))]
>>> dist = arr2_pairs - arr1.T
>>> (dist ** 2).sum(axis=2) ** 0.5
array([[ 5. , 5. ],
[ 9.69535971, 9.69535971],
[ 13. , 13. ]])
However, converting arrays from list comprehensions tends to be slow. It might be faster to use stride_tricks -- here again, see which one suits your purposes best:
>>> as_strided(arr2.T, strides=(8, 8, 32), shape=(3, 2, 3))
array([[[ 3, 5, 2],
[ 6, 8, 5]],
[[ 6, 8, 5],
[ 5, 13, 2]],
[[ 5, 13, 2],
[ 8, 16, 5]]])
This actually manipulates the way numpy
moves over a block of memory, allowing a small array to emulate a bigger array.
>>> arr2_pairs = as_strided(arr2.T, strides=(8, 8, 32), shape=(3, 2, 3))
>>> dist = arr2_pairs - arr1.T
>>> (dist ** 2).sum(axis=2) ** 0.5
array([[ 5. , 5. ],
[ 9.69535971, 9.69535971],
[ 13. , 13. ]])
So now you have a simple 2-d array corresponding to distances for each pair of columns. Now it's just a matter of getting the mean
and calling argmin
.
>>> normed = (dist ** 2).sum(axis=2) ** 0.5
>>> normed.mean(axis=1)
array([ 5. , 9.69535971, 13. ])
>>> min_window = normed.mean(axis=1).argmin()
>>> arr2[:,[min_window, min_window + 1]]
array([[3, 6],
[5, 8],
[2, 5]])
Upvotes: 5
Reputation: 5619
You can get the distance matrix using cdist from scipy.spatial.distance. Once you have the distance matrix, you can just sum across columns and normalize to get the average distance, if that's what you're looking for.
Note: Instead of columns, cdist uses rows to compute the pairwise distances.
Here you have an example using the 'cosine' distance:
from scipy.spatial.distance import cdist
arr1 = np.array( [[1, 7],
[4, 8],
[4, 0]] )
arr2 = array( [[1, 9, 3, 6, 2],
[3, 9, 0, 2, 3],
[6, 0, 2, 7, 4]] )
# distance matrix
D = cdist( arr1.transpose(), arr2.transpose(), 'cosine' )
# average distance array (each position corresponds to each column of arr1)
d1 = D.mean( axis=1 )
# average distance array (each position corresponds to each column of arr2)
d2 = D.mean( axis=0 )
# Results
d1 = array([ 0.23180963, 0.35643282])
d2 = array([ 0.31018485, 0.19337869, 0.46050302, 0.3233269 , 0.18321265])
There are many distances available. Check out the documentation.
Upvotes: 3