matlabperformancematrixmultidimensional-arrayleast-squares

Reputation: 127

Improving the performance of many sub-matrix left division operations (mldivide, \)

I have two matrices, a1 and a2. a1 is 3x12000 and a2 is 3x4000. I want to create another array that is 3x4000 which is the left matrix division (mldivide, \) of the 3x3 sub-matrices of a1 and the 3x1 sub-matrices of a2. You can do this easily with a for loop:

for ii = 1:3:12000
  a = a1(:,ii:ii+2)\a2(:, ceil(ii/3));
end

However, I was wondering if there was a faster way to do this.

Edit: I am aware preallocation increases speed, I was just showing that for visual purposes.

Edit2: Removed iterative increase of array. It seems my questions has been misinterpreted a bit. I was mainly wondering if there were some matrix operations I could do to achieve my goal as that would likely be quicker than a for loop i.e. reshape a1 to a 3x3x4000 matrix and a2 to a 3x1x4000 matrix and left matrix divide each level in one go, however, you can't left matrix divide with 3D matrices.

Upvotes: 2

Answers (3)

Eliahu Aaron

Reputation: 4562

You can create one system of equations containing many independent 'sub-systems' of equations by putting the sub-matrices of a1 in a the diagonal of a 12000x12000 matrix like this:

a1(1,1) a1(1,2) a1(1,3)    0       0      0        0       0      0       
a1(2,1) a1(2,2) a1(2,3)    0       0      0        0       0      0       
a1(3,1) a1(3,2) a1(3,3)    0       0      0        0       0      0       
   0       0       0    a1(1,4) a1(1,5) a1(1,6)    0       0      0       
   0       0       0    a1(2,4) a1(2,5) a1(2,6)    0       0      0       
   0       0       0    a1(3,4) a1(3,5) a1(3,6)    0       0      0       
   0       0       0       0       0       0    a1(1,7) a1(1,8) a1(1,9)   
   0       0       0       0       0       0    a1(2,7) a1(2,8) a1(2,9)   
   0       0       0       0       0       0    a1(3,7) a1(3,8) a1(3,9)

and then left divide it by a2(:).

This can be done using kron and sparse matrix like this (source):

a1_kron = kron(speye(12000/3),ones(3));
a1_kron(logical(a1_kron)) = a1(:);
a = a1_kron\a2(:);
a = reshape(a, [3 12000/3]);

Advantage - Speed: This is about 3-4 times faster than a for loop with preallocation on my PC.

Disadvantage: There is one disadvantage you must consider with this approach: when using left division, Matlab looks for the best way to solve the systems of linear equations, so if you solve each sub-system independently, the best way will be chosen for each sub-system, but if you solve theme as one system, Matlab will find the best way for all the sub-systems together - not the best for each sub-system.

Note: As shown in Stefano M's answer, using one big system of equations (using kron and sparse matrix) is faster than using a for loop (with preallocation) only for very small size of sub-systems of equations (on my PC, for number of equation <= 7) for bigger sizes of sub-systems of equations, using a for loop is faster.

Comparing different methods

I wrote and ran a code to compare 4 different methods for solving this problems:

for loop, no preallocation
for loop, with preallocation
kron
cellfun

Test:

n = 1200000;
a1 = rand(3,n);
a2 = rand(3,n/3);

disp('Method 1: for loop, no preallocation')
tic
a_method1 = [];
for ii = 1:3:n
  a_method1 = [a_method1 a1(:,ii:ii+2)\a2(:, ceil(ii/3))];
end
toc
disp(' ')

disp('Method 2: for loop, with preallocation')
tic
a1_reshape = reshape(a1, 3, 3, []);
a_method2 = zeros(size(a2));
for i = 1:size(a1_reshape,3)
    a_method2(:,i) = a1_reshape(:,:,i) \ a2(:,i);
end
toc
disp(' ')

disp('Method 3: kron')
tic
a1_kron = kron(speye(n/3),ones(3));
a1_kron(logical(a1_kron)) = a1(:);
a_method3 = a1_kron\a2(:);
a_method3 = reshape(a_method3, [3 n/3]);
toc
disp(' ')

disp('Method 4: cellfun')
tic
a1_cells = mat2cell(a1, size(a1, 1), repmat(3 ,1,size(a1, 2)/3));
a2_cells = mat2cell(a2, size(a2, 1), ones(1,size(a2, 2)));
a_cells = cellfun(@(x, y) x\y, a1_cells, a2_cells, 'UniformOutput', 0);
a_method4 = cell2mat(a_cells);
toc
disp(' ')

Results:

Method 1: for loop, no preallocation
Elapsed time is 747.635280 seconds.
 
Method 2: for loop, with preallocation
Elapsed time is 1.426560 seconds.
 
Method 3: kron
Elapsed time is 0.357458 seconds.
 
Method 4: cellfun
Elapsed time is 3.390576 seconds.

Comparing the results of the four methods, you can see that using method 3 - kron gives slightly different results:

disp(['sumabs(a_method1(:) - a_method2(:)): ' num2str(sumabs(a_method1(:)-a_method2(:)))])
disp(['sumabs(a_method1(:) - a_method3(:)): ' num2str(sumabs(a_method1(:)-a_method3(:)))])
disp(['sumabs(a_method1(:) - a_method4(:)): ' num2str(sumabs(a_method1(:)-a_method4(:)))])

Result:

sumabs(a_method1(:) - a_method2(:)): 0
sumabs(a_method1(:) - a_method3(:)): 8.9793e-05
sumabs(a_method1(:) - a_method4(:)): 0

Upvotes: 4

Stefano M

Reputation: 4818

~~Marginal~~Biggest improvement would be to preallocate the output matrix, instead of growing it:

A1 = reshape(A1, 3, 3, []);
a = zeros(size(A2));
for i = 1:size(A1,3) 
    a(:,i) = A1(:,:,i) \ A2(:,i);
end

With the preallocate array, if the Parallel Toolbox is available, you can try parfor

Edit

This answer is no more relevant, since the OP rephrased the question to avoid growing the result array, which was the original major bottleneck.

The problem here is that one has to solve 4000 independent 3x3 linear systems. The matrix is so small that an ad hoc solution could be of interest, especially if one has some information on the matrix properties (symmetric, or not, condition number, etc.). However sticking to the \ matlab operator, the best way to speed up computations is by explicitly leverage of parallelism, e.g. by the parfor command.

The sparse matrix solution of the other answer by Eliahu Aaron is indeed very clever, but its speed advantage is not general but depends on the specific problem size.

With this function you can explore different problem sizes:

function [t2, t3] = sotest(n, n2)

a1 = rand(n,n*n2);
a2 = rand(n,n2);

tic
a1_reshape = reshape(a1, n, n, []);
a_method2 = zeros(size(a2));
for i = 1:size(a1_reshape,3)
    a_method2(:,i) = a1_reshape(:,:,i) \ a2(:,i);
end
t2 = toc;

tic
a1_kron = kron(speye(n2),ones(n));
a1_kron(logical(a1_kron)) = a1(:);
a_method3 = a1_kron\a2(:);
a_method3 = reshape(a_method3, [n n2]);
t3 = toc;

assert ( norm(a_method2 - a_method3, 1) / norm(a_method2, 1) < 1e-8)

Indeed for n=3 the sparse matrix method is clearly superior, but for increasing n it becomes less competitive

The above figure was obtained with

>> for i=1:20; [t(i,1), t(i,2)] = sotest(i, 50000); end
>> loglog(1:20, t, '*-')

My final comment is that an explicit loop with the dense \ operator is indeed fast; the sparse matrix formulation is slightly less accurate and could become problematic in edge cases; and for sure the sparse matrix solution is not very readable. If the number n2 of systems to solve is very very big (>1e6) then maybe ad hoc solutions should be explored.

Upvotes: 1

Cris Luengo

Reputation: 60680

You are solving a series of N systems with m linear equations each, the N systems are of the form

Ax = b

You can convert these to a single system of Nm linear equations:

|A1 0  0  ... 0 | |x1|   |b1|
|0  A2 0  ... 0 | |x2|   |b2|
|0  0  A3 ... 0 | |x3| = |b3|
|.  .  .  ... . | |. |   |. |
|0  0  0  ... AN| |xN|   |bN|

However, solving this one system of equations is a lot more expensive than solving all the little ones. Typically, the cost is O(n^3), so you go from O(N m^3) to O((Nm)^3). A huge pessimization. (Eliahu proved me wrong here, apparently the sparsity of the matrix can be exploited.)

Reducing the computational cost can be done, but you need to provide guarantees about the data. For example, if the matrices A are positive definite, the systems can be solved more cheaply. Nonetheless, given that you are dealing with 3x3 matrices, the winnings there will be slim, since those are pretty simple systems to solve.

If you are asking this because you think that loops are inherently slow in MATLAB, you should know that this is no longer the case, and hasn’t been the case since MATLAB gained a JIT 15 years or so ago. Today, many vectorization attempts lead to equally fast code, and oftentimes to slower code, especially for large data. (I could fish up some timings I’ve posted here on SO to prove this if necessary.)

I would think that solving all systems in one go could reduce the number of checks that MATLAB does every time the operator \ is called. That is, hard-coding the problem size and type might improve throughout. But the only way to do so is to write a MEX-file.

Upvotes: 1

Improving the performance of many sub-matrix left division operations (mldivide, \)

Answers (3)

Comparing different methods

Edit

Related Questions