Reputation: 55
My question relates to a sorting exercise, which I can undertake easily (but perhaps slowly) in R and would like to undertake in C++ in order to speed up my code.
Consider three vectors of the same size a,b and c. In R, the following command would sort the vector first in terms of b and then, in case of ties, would further sort in terms of c.
a<-a[order(b,c),1]
Example:
a<-c(1,2,3,4,5)
b<-c(1,2,1,2,1)
c<-c(5,4,3,2,1)
> a[order(b,c)]
[1] 5 3 1 4 2
Is there an efficient way to undertake this in C++ using Armadillo vectors?
Upvotes: 2
Views: 1162
Reputation: 16930
We can write the following C++ solution, which I have in a file SO_answer.cpp
:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace arma;
// [[Rcpp::export]]
vec arma_sort(vec x, vec y, vec z) {
// Order the elements of x by sorting y and z;
// we order by y unless there's a tie, then order by z.
// First create a vector of indices
uvec idx = regspace<uvec>(0, x.size() - 1);
// Then sort that vector by the values of y and z
std::sort(idx.begin(), idx.end(), [&](int i, int j){
if ( y[i] == y[j] ) {
return z[i] < z[j];
}
return y[i] < y[j];
});
// And return x in that order
return x(idx);
}
What we've done is take advantage of the fact that std::sort()
allows you to sort based on a custom comparator. We use a comparator that compares the elements of z
only if the elements of y
are equal; otherwise it compares the values of y
.1 Then we can compile the file and test the function in R:
library(Rcpp)
sourceCpp("SO_answer.cpp")
set.seed(1234)
x <- sample(1:10)
y <- sample(1:10)
z <- sample(1:10)
y[sample(1:10, 1)] <- 1 # create a tie
all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good
Of course, we must also consider whether this actually gives you any performance increase, which is the whole reason why you're doing this. Let's benchmark:
library(microbenchmark)
microbenchmark(r = x[order(y, z)],
arma = arma_sort(x, y, z),
times = 1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
r 36.040 37.23 39.386160 37.64 38.32 3316.286 10000 b
arma 5.055 6.07 7.155676 7.00 7.53 107.230 10000 a
On my machine, it looks like you get about a 5-6X increase in speed with small vectors, though this advantage doesn't hold as well when you scale up:
x <- sample(1:100)
y <- sample(1:100)
z <- sample(1:100)
y[sample(1:100, 10)] <- 1 # create some ties
all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good
microbenchmark(r = x[order(y, z)],
arma = arma_sort(x, y, z),
times = 1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
r 44.50 46.360 48.01275 46.930 47.755 294.051 10000 b
arma 10.76 12.045 16.30033 13.015 13.715 5262.132 10000 a
x <- sample(1:1000)
y <- sample(1:1000)
z <- sample(1:1000)
y[sample(1:100, 10)] <- 1 # create some ties
all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good
microbenchmark(r = x[order(y, z)],
arma = arma_sort(x, y, z),
times = 1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
r 113.765 118.7950 125.7387 120.5075 122.4475 3373.696 10000 b
arma 82.690 91.3925 104.0755 95.2350 99.4325 6040.162 10000 a
It's still faster, but by less than 2X once you're at vectors of length 1000. This is probably why F. Privé said this operation should be fast enough in R. While moving to C++ using Rcpp can give you great performance advantages, the extent to which you get gains is largely dependent on context, as mentioned many times by Dirk Eddelbuettel in answers to various questions here.
sort()
or sort_index()
(see the Armadillo docs here). If you're trying to sort a vec
by the values of a second vec
, you could usex(arma::sort_index(y))
as I indicated in an answer to a related question here. You can even use stable_sort_index()
to preserve ties. However, I couldn't figure out how to use these functions to solve the specific problem you present here.
Upvotes: 6