Reputation: 6532
I've been testing Rcpp and RcppArmadillo for calculating summary stats on big matrices. This was a lot faster (5 or 10 times faster) than the base R colMeans or the the Armadillo on ~4million rows, 45 columns.
colMeansRcpp <- cxxfunction(signature(X_="integer"),
plugin='Rcpp',
body='
Rcpp::IntegerMatrix X = X_;
int ncol = X.ncol(); int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for(int col = 0; col < ncol; col++){
out[col]=Rcpp::sum(X(_, col));
}
return wrap(out/nrow);
')
I really want to calculate the median and maybe other quantiles for plotting - and because it requires a sort its even more needy of C++ outsourcing. The armadillo seems a bit slow so I wanted to do an in place sort on code similar to above but I just cant get the syntax right... here is what I'm trying..
# OK I'm aware this floor(nrow/2) is not **absolutely** correct
# I'm simplifying here
colMedianRcpp <- cxxfunction(signature(X_="integer"),
plugin='Rcpp',
body='
Rcpp::IntegerMatrix X = clone(X_);
int ncol = X.ncol(); int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for(int col = 0; col < ncol; col++){
X(_,col)= std::sort((X_,col).begin, (X_,col).end));
out[col]=X(floor(nrow/2), col));
}
return wrap(out);
')
Basically it's the line
X(_,col)= std::sort((X_,col).begin, (X_,col).end));
I don't know how to express "sort a column in place" with this mixture of Rcpp sugar and std C++. Sorry I can see what I'm doing is wrong but a hint on the right syntax would be lovely.
ps Am I right I need to do this clone() so I don't change the R object?
EDIT I add the RcppArmadillo code and a benchmark comparison to address the answer/comment below. the benchmark was only on 50k rows for a quick reply but I recall it was similar with many more. I realise you are the Rcpp author.. so many thanks for your time!
The thought occurs that perhaps I'm doing something daft with the RcppArmadillo code to make it run so much slower than the base colMeans or Rcpp version?
colMeansRcppArmadillo <- cxxfunction(signature(X_="integer"),
plugin="RcppArmadillo",
body='
arma::mat X = Rcpp::as<arma::mat > (X_);
arma::rowvec MD= arma::mean(X, 0);
return wrap(MD);
')
And the benchmark is ...
(mb = microbenchmark(
+ colMeans(fqSmallMatrix),
+ colMeansRcpp(fqSmallMatrix),
+ colMeansRcppArmadillo(fqSmallMatrix),
+ times=50))
Unit: milliseconds
expr min lq median uq max neval
colMeans(fqSmallMatrix) 10.620919 10.63289 10.640819 10.648882 10.907145 50
colMeansRcpp(fqSmallMatrix) 2.649038 2.66832 2.676709 2.700839 2.841012 50
colMeansRcppArmadillo(fqSmallMatrix) 25.687067 26.23488 33.168589 33.792489 113.832495 50
Upvotes: 2
Views: 2586
Reputation: 368399
You are not actually showing RcppArmadillo code -- I have been quite happy with the performance of RcppArmadillo code where I needed row/col column subsetting.
You can instantiate Armadillo matrices via Rcpp just about as efficiently (no copy, re-using R object memory) so I would try that.
And you: you want clone()
for a distinct copy, and I think you'd get that for free if you use the default RcppArmadillo ctor (rather than the more efficient two-step).
Edit a few hours later
You had left an open question about why your Armadillo was slow. In the meantime, Vincent solved the issue for you but here is a revisited, cleaner solution using your code as well as Vincent's.
Now how it instantiates the Armadillo matrix without copy -- so it is faster. And it also avoids mixing integer and numeric matrices. The code first:
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
NumericVector colMedianRcpp(NumericMatrix x) {
int nrow = x.nrow();
int ncol = x.ncol();
int position = nrow / 2; // Euclidian division
NumericVector out(ncol);
for (int j = 0; j < ncol; j++) {
NumericVector y = x(_,j); // Copy column -- original will not be mod
std::nth_element(y.begin(), y.begin() + position, y.end());
out[j] = y[position];
}
return out;
}
// [[Rcpp::export]]
arma::rowvec colMeansRcppArmadillo(NumericMatrix x){
arma::mat X = arma::mat(x.begin(), x.nrow(), x.ncol(), false);
return arma::mean(X, 0);
}
// [[Rcpp::export]]
NumericVector colMeansRcpp(NumericMatrix X) {
int ncol = X.ncol();
int nrow = X.nrow();
Rcpp::NumericVector out(ncol);
for (int col = 0; col < ncol; col++){
out[col]=Rcpp::sum(X(_, col));
}
return wrap(out/nrow);
}
/*** R
set.seed(42)
X <- matrix(rnorm(100*10), 100, 10)
library(microbenchmark)
mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
colMedianRcpp(X), times=50)
print(mb)
*/
And here is the result on my machine, with the concise Armadillo version about as fast as yours, and median a little slower as it has to do more work:
R> sourceCpp("/tmp/stephen.cpp")
R> set.seed(42)
R> X <- matrix(rnorm(100*10), 100, 10)
R> library(microbenchmark)
R> mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
+ colMedianRcpp(X), times=50)
R> print(mb)
Unit: microseconds
expr min lq median uq max neval
colMeans(X) 9.469 10.422 11.5810 12.421 30.597 50
colMeansRcpp(X) 3.922 4.281 4.5245 5.306 18.020 50
colMeansRcppArmadillo(X) 4.196 4.549 4.9295 5.927 11.159 50
colMedianRcpp(X) 15.615 16.291 16.7290 17.971 27.026 50
R>
Upvotes: 2
Reputation: 32381
You can copy the column into a new vector with
NumericVector y = x(_,j);
Complete example:
library(Rcpp)
cppFunction('
NumericVector colMedianRcpp(NumericMatrix x) {
int nrow = x.nrow();
int ncol = x.ncol();
int position = nrow / 2; // Euclidian division
NumericVector out(ncol);
for (int j = 0; j < ncol; j++) {
NumericVector y = x(_,j); // Copy the column -- the original will not be modified
std::nth_element(y.begin(), y.begin() + position, y.end());
out[j] = y[position];
}
return out;
}
')
x <- matrix( sample(1:12), 3, 4 )
x
colMedianRcpp(x)
x # Unchanged
Upvotes: 5