Harkan
Harkan

Reputation: 192

Vectorizing distance to several points on Octave (Matlab)

I'm writing a k-means algorithm. At each step, I want to compute the distance of my n points to k centroids, without a for loop, and for d dimensions.

The problem is I have a hard time splitting on my number of dimensions with the Matlab functions I know. Here is my current code, with x being my n 2D-points and y my k centroids (also 2D-points of course), and with the points distributed along dimension 1, and the spatial coordinates along the dimension 2:

  dist = @(a,b) (a - b).^2;

  dx = bsxfun(dist, x(:,1), y(:,1)'); % x is (n,1) and y is (1,k)
  dy = bsxfun(dist, x(:,2), y(:,2)'); % so the result is (n,k)
  
  dists = dx + dy; % contains the square distance of each points to the k centroids
  [_,l] = min(dists, [], 2); % we then argmin on the 2nd dimension

How to vectorize furthermore ?

First edit 3 days later, searching on my own

Since asking this question I made progress on my own towards vectorizing this piece of code. The code above runs in approximately 0.7 ms on my example.

I first used repmat to make it easy to do broadcasting:

dists = permute(permute(repmat(x,1,1,k), [3,2,1]) - y, [3,2,1]).^2;
dists = sum(dists, 2);
[~,l] = min(dists, [], 3);

As expected it is slightly slower since we replicate the matrix, it runs at 0.85 ms.

From this example it was pretty easy to use bsxfun for the whole thing, but it turned out to be extremely slow, running in 150 ms so more than 150 times slower than the repmat version:

dist = @(a, b) (a - b).^2;
dists = permute(bsxfun(dist, permute(x, [3, 2, 1]), y), [3, 2, 1]);
dists = sum(dists, 2);
[~,l] = min(dists, [], 3);

Why is it so slow ? Isn't vectorizing always an improvement on speed, since it uses vector instructions on the CPU ? I mean of course simple for loops could be optimized to use it aswell, but how can vectorizing make the code slower ? Did I do it wrong ?

Using a for loop

For the sake of completeness, here's the for loop version of my code, surprisingly the fastest running in 0.4 ms, not sure why..

for i=1:k
  dists(:,i) = sum((x - y(i,:)).^2, 2);
endfor
[~,l] = min(dists, [], 2);

Upvotes: 2

Views: 517

Answers (1)

pho
pho

Reputation: 25489

Note: This answer was written when the question was also tagged MATLAB. Links to Octave documentation added after the MATLAB tag was removed.


You can use the pdist2MATLAB/Octave function to calculate pairwise distances between two sets of observations. This way, you offload the bother of vectorization to the people who wrote MATLAB/Octave (and they have done a pretty good job of it)

X = rand(10,3);
Y = rand(5,3);

D = pdist2(X, Y);

D is now a 10x5 matrix where the i, jth element is the distance between the ith X and jth Y point.

You can pass it the kind of distance you want as the third argument -- e.g. 'euclidean', 'minkowski', etc, or you could pass a function handle to your custom function like so:

dist = @(a,b) (a - b).^2;
D = pdist2(X, Y, dist);

As saastn mentions, pdist2(..., 'smallest', k) makes things easier in k-means. This returns just the smallest k values from each column of pdist2's result. Octave doesn't have this functionality, but it's easily replicated using sort()MATLAB/Octave.

D_smallest = sort(D);
D_smallest = D_smallest(1:k, :);

Upvotes: 4

Related Questions