Reputation: 193
I observed that matrix by vector multiplication with a large matrix is significantly slower in R (version 3.6.1) than in Matlab (version 2019b), while both languages rely on the (same?) BLAS library. See below a minimal example:
In Matlab:
n=900;
p=900;
A=reshape(1:(n*p),[n,p]);
x=ones(p,1);
tic()
for id = 1:1000
x = A*x;
end
toc()
In R:
n=900
p=900
A=matrix(c(1:(n*p)),nrow=n,ncol=p)
x=rep(1,ncol(A))
t0 <- Sys.time()
for(iter in 1:1000){
x = A%*%x
}
t1 <- Sys.time()
print(t1-t0)
I get a running execution time of roughly 0.05sec in Matlab versus 3.5sec in R using the same computer. Any idea of the reason for such difference?
Thanks.
[EDIT]: I add below a similar calculus in C (using the CBLAS library, compilation using gcc cblas_dgemv.c -lblas -o cblas_dgemv
, where cblas_dgemv.c denotes the source file below). I get a running time of roughly 0.08s which is quite close to the running times obtained using Matlab (0.05s). I am still trying to figure out the reason of this huge running time in R (3.5s).
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include "cblas.h"
int main(int argc, char **argv)
{
int m=900,adr;
double *A,*x,*y;
struct timeval t0,t1;
/* memory allocation and initialization */
A = (double*)malloc(m*m*sizeof(double));
x = (double*)malloc(m*sizeof(double));
y = (double*)malloc(m*sizeof(double));
for(adr=0;adr<m;adr++) x[adr] = 1.;
for(adr=0;adr<m*m;adr++) A[adr] = adr;
/* main loop */
gettimeofday(&t0, NULL);
for(adr=0;adr<1000;adr++)
cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A,m,x,1,0.,y,1);
gettimeofday(&t1, NULL);
printf("elapsed time = %.2e seconds\n",(double)(t1.tv_usec-t0.tv_usec)/1000000. + (double)(t1.tv_sec-t0.tv_sec));
/* free memory */
free(A);
free(x);
free(y);
return EXIT_SUCCESS;
}
Notice that I was not able to set y=x in the cblas_dgemv routine. Thus, this C calculus is slightly different from that done in R and Matlab codes above. However the compilation was done without optimization flag (no option -O3) and I checked that the matrix-vector product was indeed called at each iteration of the loop (performing 10x more iterations leads to a 10x longer running time).
Upvotes: 7
Views: 466
Reputation: 193
I used Rcpp to interface a C++ implementation of the matrix-vector product with R.
Source C++ file ('cblas_dgemv2.cpp'): performes 'niter' times the product y = A*x
#include <cblas-openblas.h>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
Rcpp::NumericVector cblas_dgemv2(Rcpp::NumericMatrix A, Rcpp::NumericVector x, int niter)
{
int m=A.ncol(),iter;
Rcpp::NumericVector y(m);
for(iter=0;iter<niter;iter++)
cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A.begin(),m,x.begin(),1,0.,y.begin(),1);
return y;
}
Then I perform two experiments using the R code below:
y=cblas_dgmev2(A,x,1000)
to perform 1000 times in C++ the computation of the product y=Ax*; y=cblas_dgemv2(A,x,1)
, each call performs in C++ the product y=A*x. # compile cblas_dgemv2 (you may need to update the path to the library)
PKG_LIBS <- '/usr/lib/x86_64-linux-gnu/libopenblas.a'
PKG_CPPFLAGS <- '-I/usr/include/x86_64-linux-gnu'
Sys.setenv(PKG_LIBS = PKG_LIBS , PKG_CPPFLAGS = PKG_CPPFLAGS)
Rcpp::sourceCpp('cblas_dgemv2.cpp', rebuild = TRUE)
# create A and x
n=900
A=matrix(c(1:(n*n)),nrow=n,ncol=n)
x=rep(1,n)
# Experiment 1: call 1 time cblas_dgemv2 (with 1000 iterations in C++)
t0 <- Sys.time()
y=cblas_dgemv2(A,x,1000) # perform 1000 times the computation y = A%*%x
t1 <- Sys.time()
print(t1-t0)
# Experiment 2: call 1000 times cblas_dgemv2
t0 <- Sys.time()
for(iter in 1:1000){
y=cblas_dgemv2(A,x,1) # perform 1 times the computation y = A%*%x
}
t1 <- Sys.time()
print(t1-t0)
I get a running time of 0.08s for the first experiment versus 4.8s for the second experiment.
My conclusion: the bottleneck in terms of the running times comes from the data transferts between R and C++, not from the computation of the matrix-vector product itself. Surprising, isn't it?
Upvotes: 1
Reputation: 23908
Here's something kind of shocking:
The precompiled R distribution that is downloaded from CRAN makes use of the reference BLAS/LAPACK implementation for linear algebra operations
The "reference BLAS" is the non-optimized, non-accelerated BLAS, unlike OpenBLAS or Intel MKL. Matlab uses MKL, which is accelerated.
This seems to be confirmed by sessionInfo()
in my R 3.6.0 on macOS:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
If I'm reading this right, that means that by default, R uses a slow BLAS, and if you want it to go fast, you need to do some configuration to get it to use a fast BLAS instead.
This is a little surprising to me. As I understand it, the reference BLAS is generally primarily used for testing and development, not for "actual work".
I get about the same timings as you in R 3.6 vs Matlab R2019b on macOS 10.14: 0.04 seconds in Matlab, 4.5 seconds in R. I think that's consistent with R using a non-accelerated BLAS.
Upvotes: 10