HCAI
HCAI

Reputation: 2263

Matlab: Euclidean norm (or difference) between two vectors

I'd like to calculate the Euclidean distance between a vector G and each row of an array C, while dividing each row by a value in a vector GSD. What I've done seems very inefficient. What's my biggest overhead? Could I speed it up?

m=1E7;
G=1E5*rand(1,8);
C=1E5*[zeros(m,1),rand(m,8)]; 
GSD=10*rand(1,8);

%I've taken the log10 of the values because G and C are very large in magnitude. 
%Don't know if it's worth it.

for i=1:m
    dG(i,1)=norm((log10(G)-log10(C(i,2:end)))/log10(GSD));
end

Using the examples from below, they don't all give the same answer. In fact none of them give the same answer (see following figure using:

dG = pdist2(log10(G),log10(C(:,2:end)),'mahalanobis',diag(log10(GSD))); %(1)

dG = sqrt(sum((log10(G)-log10(C(:,2:end))./log10(GSD)).^2,2)); 

tmp=bsxfun(@rdivide,bsxfun(@minus,log10(G),log10(C(:,2:end))),log10(GSD)); %(4)
dG = sqrt(sum(tmp.^2,2));

enter image description here

Upvotes: 0

Views: 694

Answers (2)

Nicky Mattsson
Nicky Mattsson

Reputation: 3052

You can use pdist2(x,y) to calculate the pairwise distance between all elements in x and y, thus your example would be something like

dG = pdist2(log10(G),log10(C(:,2:end)),'mahalanobis',diag(log10(GSD)).^2);

where the name-pair 'mahalanobis',diag(log10(GSD)).^2 puts log10(GSD) as weights on the Eucledean, which is the known as the Mahalanobis distance.

Note that the Mahalanobis distance is originally intented for normalising data, thus it is the "covariance" which have to be put as the fourth input, which MATLAB then finds the Cholesky decomposition of (element-wise squareroot when diagonal, as here).

Implicit expansion

In newer MATLAB editions, one can also just just the implcit expansion as the first entry is only 1 vector.

dG = sqrt(sum(((log10(G)-log10(C(:,2:9)))./log10(GSD)).^2,2));

which is probably a tad faster, I do, however, prefer the pdist2 solution as I find it clearer.

Upvotes: 1

Brice
Brice

Reputation: 1580

The floating point should handle the large magnitude of the input data, up to a certain point with float data and with any reasonable value with double data

realmax('single')
ans =
  3.4028e+38

realmax('double')
ans =
  1.7977e+308

With 1e7 values in the +/- 1e5 range, you may expect the square of the Euclidian distance to be in the +/- 1e17 range (5+5+7), which both formats will handle with ease.

In any case, you should vectorize the code to remove the loop (which Matlab has a history of handling very inefficiently, especially in older versions)

With new versions (2016b and later), simply use:

tmp=(log10(G)-log10(C(:,2:end)))./log10(GSD);
dG = sqrt(sum(tmp.^2,2)); %row-by-row norm

Note that you have to use ./ which is a element-wise division, not / which is matrix right division.

The following code will work everywhere

tmp=bsxfun(@rdivide,bsxfun(@minus,log10(G),log10(C(:,2:end))),log10(GSD));
dG = sqrt(sum(tmp.^2,2)); %row-by-row norm

I however believe that the use of log10 is a mathematical error. The result dG will not be the Euclidian norm. You should stick with the root mean square of the weighted difference:

dG = sqrt(sum(bsxfun(@rdivide,bsxfun(@minus,G,C(:,2:end)),GSD).^2,2)); % all versions
dG = sqrt(sum((G-C(:,2:end)./GSD).^2,2)); %R2016b and later

Upvotes: 1

Related Questions