Reputation: 1002
Edit: See this question where I learned how to parallelize sparse matrix-vector multiplication in Python using Numba, and was able to tie with Matlab.
Original question:
I'm observing that sparse matrix-vector multiplication is about 4 or 5 times faster in Matlab than in Python (using scipy sparse matrices). Here are some details from the Matlab command line:
>> whos A
Name Size Bytes Class Attributes
A 47166x113954 610732376 double sparse
>> whos ATrans
Name Size Bytes Class Attributes
ATrans 113954x47166 610198072 double sparse
>> nnz(A)/numel(A)
ans =
0.0071
>> whos x
Name Size Bytes Class Attributes
x 113954x1 911632 double
>> myFun = @() A*x; timeit(myFun)
ans =
0.0601
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.0120
The matrix ATrans is the transpose of A. Notice that computing A times x in Matlab takes about .06 seconds, but if I use the weird "transpose trick" and compute ATrans' times x (which yields the same result as A times x), the computation takes .012 seconds. (I do not understand why this trick works.)
Here are some timing results from the Python command line:
In[50]: type(A)
Out[50]:
scipy.sparse.csc.csc_matrix
In[51]: A.shape
Out[51]:
(47166, 113954)
In[52]: type(x)
Out[52]:
numpy.ndarray
In[53]: x.shape
Out[53]:
(113954L, 1L)
In[54]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
...:
Out[54]:
0.054835451461831324
So the Python runtime to do A times x is about 4.5 times longer than the Matlab runtime, for the same matrix A. If I store A in csr format, the Python runtime is a little bit worse:
In[63]: A = sp.sparse.csr_matrix(A)
In[64]: timeit.timeit('A.dot(x)',setup="from __main__ import A,x",number=100)/100.0
...:
Out[64]:
0.0722580226496575
Here's some info about which python version and which anaconda version I'm using:
In[2]: import sys; print('Python %s on %s' % (sys.version, sys.platform))
Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)] on win32
Question: Why is this sparse matrix-vector multiplication faster in Matlab than in Python? And how can I make it equally fast in Python?
Edit 1: Here is a clue. In Python, if I set the number of threads to 1, the runtime for dense matrix-vector multiplication is impacted severely, but the runtime for sparse matrix-vector multiplication is nearly unchanged.
In[48]: M = np.random.rand(1000,1000)
In[49]: y = np.random.rand(1000,1)
In[50]: import mkl
In[51]: mkl.get_max_threads()
Out[51]:
20
In[52]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[52]:
7.232593519574948e-05
In[53]: mkl.set_num_threads(1)
In[54]: timeit.timeit('M.dot(y)', setup = "from __main__ import M,y", number=100) / 100.0
Out[54]:
0.00044465965093536396
In[56]: type(A)
Out[56]:
scipy.sparse.csc.csc_matrix
In[57]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[57]:
0.055780856886028685
In[58]: mkl.set_num_threads(20)
In[59]: timeit.timeit('A.dot(x)', setup = "from __main__ import A,x", number=100) / 100.0
Out[59]:
0.05550840215802509
So for dense matrix-vector products, setting the number of threads to 1 reduced the runtime by a factor of about 6. But for the sparse matrix-vector product, reducing the number of threads to 1 did not change the runtime.
I think this suggests that in Python the sparse matrix-vector multiplication is not being performed in parallel, whereas dense matrix-vector multiplication is making use of all available cores. Do you agree with this conclusion? And if so, is there a way to make use of all available cores for sparse matrix-vector multiplication in Python?
Note that Anaconda is supposed to use Intel MKL optimizations by default: https://www.continuum.io/blog/developer-blog/anaconda-25-release-now-mkl-optimizations
Edit 2: I read here that "For sparse matrices, all level 2 [BLAS] operations except for the sparse triangular solvers are threaded" in the Intel MKL. This suggests to me that scipy is not using Intel MKL to perform sparse matrix-vector multiplication. It seems that @hpaulj (in an answer posted below) has confirmed this conclusion by inspecting the code for the function csr_matvec. So, can I just call the Intel MKL sparse matrix-vector multiplication function directly? How would I do that?
Edit 3: Here is an additional piece of evidence. The Matlab sparse matrix-vector multiplication runtime appears to be unchanged when I set the maximum number of threads equal to 1.
>> maxNumCompThreads
ans =
20
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.012545604076342
>> maxNumCompThreads(1) % set number of threads to 1
ans =
20
>> maxNumCompThreads % Check that max number of threads is 1
ans =
1
>> myFun = @() ATrans'*x; timeit(myFun)
ans =
0.012164191957568
This makes me question my previous theory that Matlab's advantage is due to multithreading.
Upvotes: 3
Views: 2011
Reputation: 231325
We get variations in times depending on the mix of sparse and dense:
In [40]: A = sparse.random(1000,1000,.1, format='csr')
In [41]: x = np.random.random((1000,1))
In [42]: Ad = A.A
In [43]: xs = sparse.csr_matrix(x)
sparse
with dense
produces dense, but sparse
with sparse
produces sparse:
In [47]: A.dot(xs)
Out[47]:
<1000x1 sparse matrix of type '<class 'numpy.float64'>'
with 1000 stored elements in Compressed Sparse Row format>
In [48]: np.allclose(A.dot(x), Ad.dot(x))
Out[48]: True
In [49]: np.allclose(A.dot(x), A.dot(xs).A)
Out[49]: True
Compared to the alternatives, sparse with dense looks pretty good:
In [50]: timeit A.dot(x) # sparse with dense
137 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [51]: timeit Ad.dot(x) # dense with dense
1.03 ms ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [52]: timeit A.dot(xs) # sparse with sparse
1.44 ms ± 644 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
With much bigger matrices relative times could well be different. Same for different sparsity.
I don't have access to MATLAB so can't make an equivalent test, though I could try this on Octave.
This is on a basic Linux (ubuntu) computer with up to date numpy and scipy.
My previous explorations, Directly use Intel mkl library on Scipy sparse matrix to calculate A dot A.T with less memory, dealt with sparse matrix with matrix calculations. Sparse with dense must use different code. We'd have to trace it to see whether it is delegating anything special to the underlying math libraries.
csc
format is slightly slower for this calculation - probably because the basic iteration is across rows.
In [80]: Ac = A.tocsc()
In [81]: timeit Ac.dot(x)
216 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
A.dot(x)
can be trace through to this call to compiled code:
In [70]: result = np.zeros((1000,1))
In [71]: sparse._sparsetools.csr_matvec(1000,1000,A.indptr, A.indices, A.data, x, result)
_sparsetools
is a .so
file, probably compiled from Cython (.pyx) code.
In sparsetools/csr.h
the code (template) for matvec
is:
void csr_matvec(const I n_row,
const I n_col,
const I Ap[],
const I Aj[],
const T Ax[],
const T Xx[],
T Yx[])
{
for(I i = 0; i < n_row; i++){
T sum = Yx[i];
for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
sum += Ax[jj] * Xx[Aj[jj]];
}
Yx[i] = sum;
}
}
That looks like straight forward c++ iteration over the csr
attributes (indptr
, indices
), multiplying and summing. There's no attempt to make use of optimized math libraries or parallel cores.
https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h
Upvotes: 4
Reputation: 4953
Well, it depends what you compare.
Basically MATLAB uses Intel MKL for its Linear Algebra calculations (At least most of it).
In Python the back end of the Linear Algebra calculations depends on the package you use (Numpy for instance) and the Distributon (Anaconda for instance).
If you use Anaconda or Intel's Distribution, Numpy is using Intel MKL which means you should expect similar performance.
In other cases, it really depends on what you have.
Upvotes: 0