Reputation: 1146
I'm trying to push the maximum from a Ryzen 3 3950x 16-core machine on Ubuntu 20.04, Microsoft R 3.5.2, with Intel MKL, and the RCpp
code is compiled with Sys.setenv(MKL_DEBUG_CPU_TYPE=5)
header.
The following are the main operations, that I'd like to optimize for:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::mat mvrnormArma(int n, arma::vec mu, arma::mat sigma) {
int ncols = sigma.n_cols;
arma::mat Y = arma::randn(n, ncols);
return arma::repmat(mu, 1, n).t() + Y * arma::chol(sigma);
}
Fast SVD (I found that the base::svd
performs better than any Rcpp
realization I've found so far, including arma::svd("dc")
, probably due to different U,S,V
dimensions).
Fast matrix multiplications for various results (found code written in C, rewrote all of it in base R and am finding vast improvements due to multicore vs previous 1-core performance. Can base R matrix operations be further improved?)
I've tried various setups with R4.0.2
and openBLAS
(through the Ropenblas package), played with various Intel MKL releases, researched about AMD's BLIS and libflame (which I don't know how to even test with R).
Overall, this setup is able to outperform a laptop with i7-8750h and Microsoft R 3.5.1 (With working MKL) by around 2x, while based on 6 vs 16 cores (and faster RAM), I was expecting at least 3-3.5x improvement (based, e.g., by cinebench and similar performance benchmarks).
How can this setup be further improved?
My main issues/questions:
First, I've noticed that the current setup, when ran with 1 worker, is using around 1000-1200% cpu when looking at top
call. Through experimentation, I've found that spawning two parallel workers uses most of the cpu, around 85-95%, and delivers the best performance. For example, 3 workers uses whole 100%, but bottlenecks somewhere, drastically reducing the performance for some reason.
I'm guessing that this is a limitation either coming from R/MKL, or something when compiling Rcpp
code, since 10-12 cores seems oddly specific. Can this be improved by some hints when compiling Rcpp code?
Secondly, I'm sure I'm not using the optimal BLAS/LAPACK/etc drivers for the job. My guess is that properly compiled R4.0.2
should be significantly faster than Microsoft R3.5.2
, but I have absolutely no idea what am I missing, whether the AVX/AVX2 are properly called/used, and what else should I try with the machine?
Lastly, I have seen zero guides for calling/working with AMD BLIS/libflame for R. If this is trivial, would appreciate any hints/help of what to look into.
Upvotes: 2
Views: 1895
Reputation: 1146
Until any other (hopefully much better) answers pops up, will post here my latest findings by guesswork. Hopefully, someone with a similar machine will find this useful. Will try expanding the answer if any additional improvements comes up.
Guide for clean R compiling. Seems outdated, but hopefully nothing much missing:
Speed up RcppArmadillo: How to link to OpenBlas in an R package
OpenBLAS works terrible on my Ryzen + Ubuntu configuration; with 3.10 BLAS, compiled with zen2 hints, uses all the CPU cores, but terribly. top
reports 3200% usage for R instance, but total CPU utilisation doesn't rise more than 20-30%. Hence, the performance is at least 3x slower than with Intel MKL.
IntelMKL. Versions till 2019 work with the MKL_DEBUG_CPU_TYPE workaround. Can confirm that intel-mkl-64bit-2019.5-075
works.
For later versions since 2020.0-088
a different
workaround is
needed. With my benchmarks the performance did not see any
improvement, however, this may change with future MKL releases.
The 10-12 hardcoded cap per instance appears to be controlled by several environmental variables. I found the following list as per this old guide. Likely that these may change with later versions, but seems to work with 2019.5-075
:
export MKL_NUM_THREADS=2
export OMP_NESTED="TRUE"
export MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
export OMP_NUM_THREADS=1
export MKL_DYNAMIC="TRUE"
export OMP_DYNAMIC="FALSE"
Playing around with various configurations I was able to find that reducing the number of threads and spawning more workers, for my specific benchmark that I've tested on, increased the performance drastically (around 3-4 fold). Even though the claimed CPU usage were similar to multicore variants of the configuration, 2 workers using 16 threads each (totaling in ~70% cpu utilisation) are much slower than 16 workers using 2 threads each (also, similar cpu utilisation). The results may vary with different tasks, so these seem to be the go-to parameters to play with for every longer task.
perf
, for my benchmarks the calls were made bli_dgemmsup_rd_haswell_asm_6x8m
, bli_daxpyv_zen_int10
and others. Not sure yet whether the settings for compiling BLIS were optimal. The takeaway here could be that both the MKL and BLIS are actually pushing max from the CPU, given my specific benchmarks... or at least that both libraries are similarly optimized.Important downside of sticking to AMD BLIS: noticed after months of usage, but there seems to be some unresolved issues with BLIS or LAPACK packed into AMD libraries that I have no idea about. I've noticed random matrix multiplication issues that are not replicable (essentially, hitting into this problem) and are solved by switching back to MKL libs. I can't say whether there's a problem with my way of building R or the actual libs, so just a warning.
Upvotes: 1