Reputation: 992
I am curious to why multiplying a sparse-matrix by a dense-matrix takes a different time than the reverse. Are the algorithms significantly different?
Here's an example in matlab 2018a:
a=sprand(M,M,0.01);
b=rand(M);
tic;ref1=a*b;t_axb=toc
tic;ref2=b*a;t_bxa=toc
Here's an example with Eigen 3 and C++ using 1 thread:
//prepare acol=MxM ColMajor Eigen sparse matrix with 0.01 density
...
Map<Matrix<double,M,M,ColMajor> > bcol (PR, M, M );
double tic,toc;
tic=getHighResolutionTime();
result=acol*bcol;
toc=getHighResolutionTime();
printf("\nacol*bcol time: %f seconds", (toc - tic));
tic=getHighResolutionTime();
result=bcol*acol;
toc=getHighResolutionTime();
printf("\nbcol*acol time: %f seconds\n", (toc - tic));
When M=4000, the results are:
t_axb =
0.6877
t_bxa =
0.4803
acol*bcol time: 0.937590 seconds
bcol*acol time: 0.532622 seconds
When M=10000, the results are
t_axb =
11.5649
t_bxa =
9.7872
acol*bcol time: 20.140380 seconds
bcol*acol time: 8.061626 seconds
In both cases, sparse-dense product is slower than dense-sparse product for both Matlab and Eigen. I am curious as to
Why is this the case? Are the algorithms for sparse-dense significantly difference than dense-sparse? The number of FLOPs is the same, right?
Why does eigen match or exceed Matlab's performance for dense-sparse but not for sparse-dense product? A small difference in performance is normal, but a factor of ~1.4-1.8 difference seems strange given that both are highly optimized libraries. I am compiling eigen with all the optimizations as per the documentation. i.e. -fPIC -fomit-frame-pointer -O3 -DNDEBUG -fopenmp -march=native
Upvotes: 5
Views: 1232
Reputation: 29205
You could observe the same difference by comparing column-major versus row-major storage for sparse-matrix times vector product: y = A * x
. If A
is row-major (equivalently each coeff of y
), then each row of A
can be treated in parallel without any overhead (no communication, no additional temporary, no additional operation). In contrast, if A
is column-major multi-threading cannot come for free, and in most cases the overhead is larger than the gain.
Even without multi-threading, you see that the memory access patterns are very different:
x
, each coeff of y
being write one only.x
is read once, but we get multiple random read-write accesses to the destination y
.So even without multi-threading the situation is naturally favorable to row-major.
Upvotes: 6