Reputation: 84659
In a package of mine, I extract a submatrix B
from a matrix A
as follows:
NumericMatrix extractColumns(NumericMatrix A, IntegerVector indices) {
int n = indices.size();
int m = A.nrow();
NumericMatrix B(m, n);
for(int i = 0; i < n; i++) {
B(_, i) = A(_, indices(i));
}
return B;
}
I observed it is slow. Then I've just benchmarked and this is indeed slower than the ordinary R way:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix extractColumns(NumericMatrix A, IntegerVector indices) {
int n = indices.size();
int m = A.nrow();
NumericMatrix B(m, n);
for(int i = 0; i < n; i++) {
B(_, i) = A(_, indices(i));
}
return B;
}
//
/*** R
library(microbenchmark)
A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)
indices <- 2000L:40000L
microbenchmark(
R = A[, indices],
Rcpp = extractColumns(A, indices - 1L),
times = 5L
)
*/
Why is it slow and is there a faster way with Rcpp? (the indices are not contiguous in general).
Also tested with RcppEigen. Slow as well.
Eigen::MatrixXd extractColumns2(Eigen::MatrixXd A, Rcpp::IntegerVector indices) {
int m = A.rows();
int n = indices.size();
Eigen::MatrixXd B(m, n);
for(int i = 0; i < n; i++) {
B.col(i) = A.col(indices(i));
}
return B;
}
Upvotes: 4
Views: 196
Reputation: 368499
This is an interesting question, and worth comparing a few approaches. I also think it is worth reiterating that we almost always have a tradeoff between clarity and performance. As such, I still like RcppArmadillo (even though the second measurement set below looks a little worrisome).
First off, the code. I added a function for RcppArmadillo which, given a vector of unsigned indices, can index directly which makes it a one-liner. Which clearly wins on clarity.
#include <RcppArmadillo.h>
#include <RcppEigen.h>
using namespace Rcpp;
// [[Rcpp::export]]
Rcpp::NumericMatrix extractColumns(Rcpp::NumericMatrix A, Rcpp::IntegerVector indices) {
int n = indices.size();
int m = A.nrow();
NumericMatrix B(m, n);
for(int i = 0; i < n; i++) {
B(_, i) = A(_, indices(i));
}
return B;
}
// [[Rcpp::export]]
NumericMatrix extractColumns1(NumericMatrix A, IntegerVector indices) {
int n = indices.size();
int m = A.nrow();
NumericMatrix B(m, n);
for(int i = 0; i < m; i++) {
for(int j = 0; j < n; j++) {
B(i, j) = A(i, indices(j));
}
}
return B;
}
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
Eigen::MatrixXd extractColumns2(Eigen::MatrixXd A, Rcpp::IntegerVector indices) {
int m = A.rows();
int n = indices.size();
Eigen::MatrixXd B(m, n);
for(int i = 0; i < n; i++) {
B.col(i) = A.col(indices(i));
}
return B;
}
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat extractColumns3(arma::mat A, arma::uvec v) {
return A.cols(v);
}
I made two changes. First, I added collapse
besides (Rcpp)Armadillo as it explicitly aims for performance (for the R user) which efficient and pure code. Second, because the original matrix was a bit degenerate with two rows times 50k colums, I create a somewhat more balanced 200 by 500 one.
/*** R
library(microbenchmark)
A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)
indices <- 2000L:40000L
microbenchmark(
R = A[, indices],
RcppBasic = extractColumns(A, indices - 1L),
RcppImprv = extractColumns1(A, indices - 1L),
RcppEigen = extractColumns2(A, indices - 1L),
RcppArma = extractColumns3(A, indices - 1L),
collapse = collapse::ss(A, 1:2, indices),
times = 25L
)
A <- matrix(rgamma(100000L, 5, 1), nrow = 200L, ncol = 500L)
indices <- 200L:400L
microbenchmark(
R = A[, indices],
RcppBasic = extractColumns(A, indices - 1L),
RcppImprv = extractColumns1(A, indices - 1L),
RcppEigen = extractColumns2(A, indices - 1L),
RcppArma = extractColumns3(A, indices - 1L),
collapse = collapse::ss(A, 1:200, indices),
times = 25L
)
all.equal(A[, indices], extractColumns(A, indices - 1L))
all.equal(A[, indices], extractColumns2(A, indices - 1L))
all.equal(A[, indices], extractColumns3(A, indices - 1L))
all.equal(A[, indices], collapse::ss(A, 1:200, indices))
*/
> Rcpp::sourceCpp("~/git/stackoverflow/76538309/question.cpp")
> library(microbenchmark)
> A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)
> indices <- 2000L:40000L
> microbenchmark(
+ > R = A[, indices],
+ > RcppBasic = extractColumns(A, indices - 1L),
+ > RcppImprv = extractColumns1(A, indices - 1L),
+ > .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval cld
R 198.698 201.356 601.502 247.764 406.431 4636.895 25 ab
RcppBasic 1260.979 1337.403 1742.815 1522.285 1789.843 4884.915 25 c
RcppImprv 310.913 339.943 623.045 533.361 587.319 2480.557 25 ab
RcppEigen 293.941 502.759 922.591 772.874 1024.857 4161.447 25 a
RcppArma 299.430 330.409 811.221 786.544 1072.527 1997.107 25 ab
collapse 201.500 210.785 316.888 327.262 409.214 465.755 25 b
> A <- matrix(rgamma(100000L, 5, 1), nrow = 200L, ncol = 500L)
> indices <- 200L:400L
> microbenchmark(
+ > R = A[, indices],
+ > RcppBasic = extractColumns(A, indices - 1L),
+ > RcppImprv = extractColumns1(A, indices - 1L),
+ > .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval cld
R 63.524 64.036 80.8516 65.433 72.149 174.069 25 a
RcppBasic 23.706 27.381 70.9232 31.737 135.762 144.999 25 a
RcppImprv 132.711 134.079 151.1988 136.953 139.232 250.357 25 a
RcppEigen 51.693 75.133 172.8276 121.080 163.501 434.985 25 a
RcppArma 79.163 95.515 290.6080 187.556 515.465 618.522 25 a
collapse 65.720 66.837 240.1691 68.680 95.058 3777.815 25 a
> all.equal(A[, indices], extractColumns(A, indices - 1L))
[1] TRUE
> all.equal(A[, indices], extractColumns1(A, indices - 1L))
[1] TRUE
> all.equal(A[, indices], extractColumns2(A, indices - 1L))
[1] TRUE
> all.equal(A[, indices], extractColumns3(A, indices - 1L))
[1] TRUE
> all.equal(A[, indices], collapse::ss(A, 1:200, indices))
[1] TRUE
>
Lots of things matter. The 'improved Rcpp' function by @Mikko looks good on the original "and oddly shaped matrix", it underperforms vis-a-vis the Rcpp Sugar enhanced one ... because Sugar can give you parallel loop unrolling which starts to matter once there is looping to be done (and hurts when not as seen here). The 'collapse' function looks and beats base R on the "odd" matrix and is at par on the normal one. Eigen and Armadillo do their thing, on the "odd" matrix they suffer (and using them forces casts that have a small cost). The bad performance of RcppArmadillo on the second example is a bit of head-scratcher.
So in sum: Details matter. The setup matters. Nothing really dominates all use cases.
Upvotes: 4
Reputation: 11908
It seems there's at least some overhead associated with the _
sugar. I noticed that writing out the inner loop explicitly makes things a bit faster, but still a bit off the native R method:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix extractColumns(NumericMatrix A, IntegerVector indices) {
int n = indices.size();
int m = A.nrow();
NumericMatrix B(m, n);
for(int i = 0; i < m; i++) {
for(int j = 0; j < n; j++) {
B(i, j) = A(i, indices(j));
}
}
return B;
}
//
/*** R
library(microbenchmark)
A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)
indices <- 2000L:40000L
microbenchmark(
R = A[, indices],
Rcpp = extractColumns(A, indices - 1L),
times = 5L
)
*/
> library(microbenchmark)
> A <- matrix(rgamma(100000L, 5, 1), nrow = 2L, ncol = 50000L)
> indices <- 2000L:40000L
> microbenchmark(
+ R = A[, indices],
+ Rcpp = extractColumns(A, indices - 1L),
+ times = 5L
+ )
Unit: microseconds
expr min lq mean median uq max neval cld
R 325.5 367.2 393.72 382.3 422.5 471.1 5 a
Rcpp 366.6 370.9 788.70 431.1 597.8 2177.1 5 a
Upvotes: 2