Reputation: 5019

Vectorize octave/matlab codes

Following is the octave codes(part of kmeans)

centroidSum = zeros(K);
valueSum = zeros(K, n);
for i = 1 : m  
  for j = 1 : K    
    if(idx(i) == j)
      centroidSum(j) = centroidSum(j) + 1;
      valueSum(j, :) = valueSum(j, :) + X(i, :);
    end
  end
end

The codes work, is it possible to vectorize the codes? It is easy to vectorize the codes without if statement, but how could we vectorize the codes with if statement?

Upvotes: 2

Answers (4)

RayD

Reputation: 386

Vectorizing the calculation makes a huge difference in performance. Benchmarking

The original code,
The partial vectorization from A. Donda and
The full vectorization from Tom,

gave me the following results:

Original Code: Elapsed time is 1.327877 seconds.

Partial Vectorization: Elapsed time is 0.630767 seconds.

Full Vectorization: Elapsed time is 0.021129 seconds.

Benchmarking code here:

%// Datasize parameters
K = 5000;
n = 5000;
m = 5000;

idx = randi(9,1,m);
X = rand(m,n);

fprintf('\nOriginal Code: ')
tic
centroidSum1 = zeros(K,1);
valueSum1 = zeros(K, n);
for i = 1 : m  
  for j = 1 : K    
    if(idx(i) == j)
      centroidSum1(j) = centroidSum1(j) + 1;
      valueSum1(j, :) = valueSum1(j, :) + X(i, :);
    end
  end
end
centroids = valueSum1 ./ centroidSum1;
toc, clear valueSum1 centroidSum1 centroids

fprintf('\nPartial Vectorization: ')
tic
centroids = zeros(K,n);
for k = 1:K
    centroids(k,:) = mean( X(idx == k, :) );
end
toc, clear centroids

fprintf('\nFull Vectorization: ')
tic
centroids = zeros(K,n);
b = idx == 1:K;
centroids = (b * X) ./ sum(b)';
toc

Note, I added an extra line to the original code to element-wise divide valueSum1 by centroidSum1 to make the output of each type of code the same.

Finally, I know this isn't strictly an "answer", however I don't have enough reputation to add a comment, and I thought the benchmarking figures were useful to anyone who is learning MATLAB (like myself) and needs some extra motivation to master vectorization.

Upvotes: 0

Tom

Reputation: 11

Not sure about its runtime performance but here's a non-convoluted vectorized implementation:

b = idx == 1:K;
centroids = (b' * X) ./ sum(b)';

Upvotes: 1

Divakar

Reputation: 221684

Brief introduction and solution code

This could be one fully vectorized approach based on -

accumarray: For accumulating summations as done for calulating valueSum. This also introduces a technique how one can use accumarray on a 2D matrix along a certain direction, which isn't possible in a straight-forward manner with it.
bsxfun: For calculating linear indices across all columns for matching row indices from idx.

Here's the implementation -

%// Store no. of columns in X for frequent usage later on
ncols = size(X,2); 

%// Find indices in idx that are within [1:k] range, call them as labels
%// Also, find their locations in that range array, call those as pos
[pos,id] = ismember(idx,1:K);
labels = id(pos);
%// OR with bsxfun: [pos,labels] = find(bsxfun(@eq,idx(:),1:K));

%// Find all labels, i.e. across all columns of X
all_labels = bsxfun(@plus,labels(:),[0:ncols-1]*K);

%// Get truncated X corresponding to all indices matches across all columns
X_cut = X(pos,:);

%// Accumulate summations within each column based on the labels.
%// Note that accumarray doesn't accept matrices, so we were required
%// to create all_labels that had same labels within each column and
%// offsetted at constant intervals from consecutive columns
acc1 = accumarray(all_labels(:),X_cut(:));

%// Regularise accumulated array and reshape back to a 2D array version
acc1_reg2D = [acc1 ; zeros(K*ncols - numel(acc1),1)]; 
valueSum = reshape(acc1_reg2D,[],ncols);
centroidSum = histc(labels,1:K); %// Get labels counts as centroid sums

Benchmarking code

%// Datasize parameters
K = 5000;
n = 5000;
m = 5000;

idx = randi(9,1,m);
X = rand(m,n);

disp('----------------------------- With Original Approach')
tic
centroidSum1 = zeros(K,1);
valueSum1 = zeros(K, n);
for i = 1 : m  
  for j = 1 : K    
    if(idx(i) == j)
      centroidSum1(j) = centroidSum1(j) + 1;
      valueSum1(j, :) = valueSum1(j, :) + X(i, :);
    end
  end
end
toc, clear valueSum1 centroidSum1

disp('----------------------------- With Proposed Approach')
tic
%// ... Code from earlied mentioned section
toc

Runtime results

----------------------------- With Original Approach
Elapsed time is 1.235412 seconds.
----------------------------- With Proposed Approach
Elapsed time is 0.379133 seconds.

Upvotes: 2

A. Donda

Reputation: 8476

I assume the purpose of the code is to compute the centroids of subsets of a set of m data points in an n-dimensional space, where the points are stored in a matrix X (points x coordinates) and the vector idx specifies for each data point the subset (1 ... K) the point belongs to. Then a partial vectorization is:

centroid = zeros(K, n)
for j = 1 : K
   centroid(j, :) = mean(X(idx == j, :));
end

The if is eliminated by indexing, in particular logical indexing: idx == j gives a boolean array which indicates which data points belong to subset j.

I think it might be possible to get rid of the second for-loop, too, but this would result in very convoluted, unintelligible code.

Upvotes: 6

Vectorize octave/matlab codes

Answers (4)

Related Questions