Reputation:
I have a set of n-dimensional representative vectors in matlab. I have to group vectors from a set of training vectors to groups represented by representative vectors based on proximity. How should I do it?
Upvotes: 0
Views: 3361
Reputation: 45762
Iif by n-dimensional vector you mean an ordered list of n-dimensional points (that's my understanding of what you want), then I have done this in the past using the mean closest distance. Basically for each point on vector one, find the smallest distance to a point on vector two. The distance between the two vectors is then the mean of all these distances. This is however not symmetrical so you should then do the same process for each point on vector 2 finding the smallest distance to vector 1 and then aggregate the two means either with a min, max or mean etc...
Here is some code I made (for 3d vectors) using loops:
function mcd = MCD(fiber1, fiber2, option)
%
%remove NaNs
fiber1(find(isnan(fiber1),1):length(fiber1),:) = [];
fiber2(find(isnan(fiber2),1):length(fiber2),:) = [];
dist = 0;
for k = 1:length(fiber1)
D = [];
for j = 1:length(fiber2)
D = [D distance(fiber1(k,:),fiber2(j,:))];
end;
dist = dist + min(D);
end;
mcd = dist / length(fiber1);
if nargin > 2
dist = 0;
for k = 1:length(fiber2)
D = [];
for j = 1:length(fiber1)
D = [D distance(fiber2(k,:),fiber1(j,:))];
end;
dist = dist + min(D);
end;
mcd2 = dist / length(fiber2);
if strcmp(option,'mean')
mcd = mean([mcd mcd2]);
elseif strcmp(option,'min')
mcd = min([mcd mcd2]);
end;
end;
but this was much too slow for me. So here is a vectorised (but difficult to follow) version that is very fast:
function mcd = MCD(fiber1, fiber2, option, sampling)
%MCD(fiber1, fiber2)
%MCD(fiber1, fiber2, option)
%MCD(fiber1, fiber2, option, sampling)
%remove NaNs
fiber1(find(isnan(fiber1),1):length(fiber1),:) = [];
fiber2(find(isnan(fiber2),1):length(fiber2),:) = [];
%sample the fibers for speed. Each fiber is represented by "sampling"
%number of points.
if nargin == 4
freq = round(length(fiber1)/sampling);
fiber1 = fiber1(1:freq:length(fiber1),:);
freq = round(length(fiber2)/sampling);
fiber2 = fiber2(1:freq:length(fiber2),:);
end;
%reshape to optimize the use of distance() for speed
FIBER2 = reshape(fiber2',[1,3,length(fiber2)]);
FIBER1 = reshape(fiber1',[1,3,length(fiber1)]); %this is only used in the symmetrical case, i.e when 'min' or 'mean' option is called
%reshape amd tile filber 1 so as to eliminate the need for two nested for
%loops thus greatly increasing the computational efficiency. The goal is to
%have a 4D matrix with 1 row and 3 columns. Dimension 3 is a smearing of
%these columns to be as long as fiber2 so that each vector (1x3) in fiber1
%can be placed "on top" as in a row above the whole of fiber2. Thus dim 3
%is as long as fiber2 and dim 4 is as long as fiber1.
fiber1 = reshape(fiber1',[1,3,length(fiber1)]); %1x3xF1
fiber1 = repmat(fiber1,[length(FIBER2),1,1]); %F2x3xF1
fiber1 = permute(fiber1,[2,1,3]); %3xF2xF1
fiber1 = reshape(fiber1,[1,3,length(FIBER2),length(FIBER1)]);%1,3,F2,F1
mcd = mean(min(distance(fiber1, repmat(FIBER2,[1,1,1,length(FIBER1)]))));
if nargin > 2
fiber2 = reshape(fiber2',[1,3,length(fiber2)]); %1x3xF1
fiber2 = repmat(fiber2,[length(FIBER1),1,1]); %F2x3xF1
fiber2 = permute(fiber2,[2,1,3]); %3xF2xF1
fiber2 = reshape(fiber2,[1,3,length(FIBER1),length(FIBER2)]);%1,3,F2,F1
mcd2 = mean(min(distance(fiber2, repmat(FIBER1,[1,1,1,length(FIBER2)]))));
if strcmp(option,'mean')
mcd = mean([mcd mcd2]);
elseif strcmp(option,'min')
mcd = min([mcd mcd2]);
end;
end;
This is the distance() function I used for the above, in my case I used Euclidean distances but you can adapt it to whatever is best for you, so long as it can accept two vectors:
function Edist = distance(vector1,vector2)
%distance(vector1,vector2)
%
%provides the Euclidean distance between two input vectors. Vector1 and
%vector2 must be row vectors of the same length. The number of elements in
%each vector is the dimnesionality thereof.
Edist = sqrt(sum((diff([vector1;vector2])).^2));
Upvotes: 1