Armadillo C++: Sorting a vector in terms of two other vectors

Question

My question relates to a sorting exercise, which I can undertake easily (but perhaps slowly) in R and would like to undertake in C++ in order to speed up my code.

Consider three vectors of the same size a,b and c. In R, the following command would sort the vector first in terms of b and then, in case of ties, would further sort in terms of c.

a<-a[order(b,c),1]

Example:

a<-c(1,2,3,4,5)
b<-c(1,2,1,2,1)
c<-c(5,4,3,2,1)

> a[order(b,c)]
[1] 5 3 1 4 2

Is there an efficient way to undertake this in C++ using Armadillo vectors?

duckmayr · Accepted Answer

We can write the following C++ solution, which I have in a file SO_answer.cpp:

#include 
// [[Rcpp::depends(RcppArmadillo)]]

using namespace arma;

// [[Rcpp::export]]
vec arma_sort(vec x, vec y, vec z) {
    // Order the elements of x by sorting y and z;
    // we order by y unless there's a tie, then order by z.
    // First create a vector of indices
    uvec idx = regspace(0, x.size() - 1);
    // Then sort that vector by the values of y and z
    std::sort(idx.begin(), idx.end(), [&](int i, int j){
        if ( y[i] == y[j] ) {
            return z[i] < z[j];
        }
        return y[i] < y[j];
    });
    // And return x in that order
    return x(idx);
}

What we've done is take advantage of the fact that std::sort() allows you to sort based on a custom comparator. We use a comparator that compares the elements of z only if the elements of y are equal; otherwise it compares the values of y.¹ Then we can compile the file and test the function in R:

library(Rcpp)
sourceCpp("SO_answer.cpp")

set.seed(1234)
x <- sample(1:10)
y <- sample(1:10)
z <- sample(1:10)

y[sample(1:10, 1)] <- 1 # create a tie

all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good

Of course, we must also consider whether this actually gives you any performance increase, which is the whole reason why you're doing this. Let's benchmark:

library(microbenchmark)
microbenchmark(r = x[order(y, z)],
               arma = arma_sort(x, y, z),
               times = 1e4)

Unit: microseconds
 expr    min    lq      mean median    uq      max neval cld
    r 36.040 37.23 39.386160  37.64 38.32 3316.286 10000   b
 arma  5.055  6.07  7.155676   7.00  7.53  107.230 10000  a

On my machine, it looks like you get about a 5-6X increase in speed with small vectors, though this advantage doesn't hold as well when you scale up:

x <- sample(1:100)
y <- sample(1:100)
z <- sample(1:100)

y[sample(1:100, 10)] <- 1 # create some ties

all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good

microbenchmark(r = x[order(y, z)],
               arma = arma_sort(x, y, z),
               times = 1e4)

Unit: microseconds
 expr   min     lq     mean median     uq      max neval cld
    r 44.50 46.360 48.01275 46.930 47.755  294.051 10000   b
 arma 10.76 12.045 16.30033 13.015 13.715 5262.132 10000  a 

x <- sample(1:1000)
y <- sample(1:1000)
z <- sample(1:1000)

y[sample(1:100, 10)] <- 1 # create some ties

all.equal(x[order(y, z)], c(arma_sort(x, y, z))) # check against R
# [1] TRUE # Good

microbenchmark(r = x[order(y, z)],
               arma = arma_sort(x, y, z),
               times = 1e4)

Unit: microseconds
 expr     min       lq     mean   median       uq      max neval cld
    r 113.765 118.7950 125.7387 120.5075 122.4475 3373.696 10000   b
 arma  82.690  91.3925 104.0755  95.2350  99.4325 6040.162 10000  a

It's still faster, but by less than 2X once you're at vectors of length 1000. This is probably why F. Privé said this operation should be fast enough in R. While moving to C++ using Rcpp can give you great performance advantages, the extent to which you get gains is largely dependent on context, as mentioned many times by Dirk Eddelbuettel in answers to various questions here.

1 _{Note that typically for sorting Armadillo vectors I would suggest using sort() or sort_index() (see the Armadillo docs here). If you're trying to sort a vec by the values of a second vec, you could usex(arma::sort_index(y)) as I indicated in an answer to a related question here. You can even use stable_sort_index() to preserve ties. However, I couldn't figure out how to use these functions to solve the specific problem you present here.}

Armadillo C++: Sorting a vector in terms of two other vectors

Answers (1)

Related Questions