Erik
Erik

Reputation: 308

Performance of runif

I am working on a custom bootstrap algorithm for a specific problem, and as I want a large number of replicates I do care about performance. In this regard, I have some questions on how to use runif properly. I'm aware that I could run benchmarks myself, but C++ optimization tends to be difficult and I would also like to understand the reasons for any difference.

First question:

Is the first code block faster than the second?

for (int i = 0; i < n_boot; i++) {
  new_random = runif(n);  //new_random is pre-allocated in class
  // do something with the random numbers
}

for (int i = 0; i < n_boot; i++) {
  NumericVector new_random = runif(n);
  // do something with the random numbers
}

It probably comes down to whether runif fills the left side or if it allocates and passes a new NumericVector.

Second question:

If both versions allocate a new vector, can I improve things by generating one random number at a time in scalar mode?

In case you are wondering, memory allocation takes up a sizable part of my processing time. I have reduced runtime by 30% by optimizing other unnecessary memory allocations away, so it does matter.

Upvotes: 3

Views: 1607

Answers (2)

nrussell
nrussell

Reputation: 18612

I set up the following struct to try to represent your scenario accurately & facilitate the benchmarking:

#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]

struct runif_test {

  size_t runs;
  size_t each;

  runif_test(size_t runs, size_t each)
  : runs(runs), each(each)
  {}
  // Your first code block
  void pre_init() {
    Rcpp::NumericVector v = no_init();
    for (size_t i = 0; i < runs; i++) {
      v = Rcpp::runif(each);
    }
  }
  // Your second code block
  void post_init() {
    for (size_t i = 0; i < runs; i++) {
      Rcpp::NumericVector v = Rcpp::runif(each);
    }
  }
  // Generate 1 draw at a time  
  void gen_runif() {
    Rcpp::NumericVector v = no_init();
    for (size_t i = 0; i < runs; i++) {
      std::generate_n(v.begin(), each, []() -> double {
        return Rcpp::as<double>(Rcpp::runif(1));
      });
    }
  }
  // Reduce overhead of pre-allocated vector
  inline Rcpp::NumericVector no_init() {
    return Rcpp::NumericVector(Rcpp::no_init_vector(each));
  } 
};

where I benchmarked the following exported functions:

// [[Rcpp::export]]
void do_pre(size_t runs, size_t each) {
  runif_test obj(runs, each);
  obj.pre_init();
}

// [[Rcpp::export]]
void do_post(size_t runs, size_t each) {
  runif_test obj(runs, each);
  obj.post_init();
}

// [[Rcpp::export]]
void do_gen(size_t runs, size_t each) {
  runif_test obj(runs, each);
  obj.gen_runif();
}

Here are the results I got:

R>  microbenchmark::microbenchmark(
    do_pre(100, 10e4)
    ,do_post(100, 10e4)
    ,do_gen(100, 10e4)
    ,times=100L)
Unit: milliseconds
                 expr      min       lq      mean   median        uq       max neval
  do_pre(100, 100000) 109.9187 125.0477  145.9918 136.3749  152.9609  337.6143   100
 do_post(100, 100000) 103.1705 117.1109  132.9389 130.4482  142.7319  204.0951   100
  do_gen(100, 100000) 810.5234 911.3586 1005.9438 986.8348 1062.7715 1501.2933   100

R>  microbenchmark::microbenchmark(
    do_pre(100, 10e5)
    ,do_post(100, 10e5)
    ,times=100L)
Unit: seconds
                  expr      min       lq     mean   median       uq      max neval
  do_pre(100, 1000000) 1.355160 1.614972 1.740807 1.723704 1.815953 2.408465   100
 do_post(100, 1000000) 1.198667 1.342794 1.443391 1.429150 1.519976 2.042511   100

So, assuming I interpreted / accurately represented your second question,

If both versions allocate a new vector, can I improve things by generating one random number at a time in scalar mode?

with my gen_runif() member function, I think we can confidently say that this is not the optimal approach - ~ 7.5x slower than the other two functions.

More importantly, to address your first question, it seems that it is a little faster to just initialize & assign a new NumericVector to the output of Rcpp::runif(n). I'm certainly no C++ expert, but I believe the second method (assigning to a new local object) was faster than the first because of copy elision. In the second case, it looks as though two objects are being created - the object on the left of the =, v, and a (temporary? rvalue?) object on the right side of the =, which is the result of Rcpp::runif(). In reality though, the compiler will most likely optimize this unnecessary step out - which I think is explained in this passage from the article I linked:

When a nameless temporary, not bound to any references, would be moved or copied into an object of the same type ... the copy/move is omitted. When that temporary is constructed, it is constructed directly in the storage where it would otherwise be moved or copied to.

This was, at least, how I interpreted the results. Hopefully someone who is more well-versed in the language can confirm / deny / correct this conclusion.

Upvotes: 6

Kevin Ushey
Kevin Ushey

Reputation: 21315

Adding to to @nrussell's answer with some implementation details...

Use the source, Luke! definitely applies here, so let's look at the implementation of Rcpp::runif here:

inline NumericVector runif( int n, double min, double max ){
    if (!R_FINITE(min) || !R_FINITE(max) || max < min) return NumericVector( n, R_NaN ) ;
    if( min == max ) return NumericVector( n, min ) ;
    return NumericVector( n, stats::UnifGenerator( min, max ) ) ;
}

We see that an interesting constructor of NumericVector is being called with a stats::UnifGenerator object. That class's definition is here:

    class UnifGenerator__0__1 : public ::Rcpp::Generator<double> {
    public:

        UnifGenerator__0__1() {}

        inline double operator()() const {
            double u;
            do {u = unif_rand();} while (u <= 0 || u >= 1);
            return u;
        }
    } ;

So, that class just a functor -- it implements operator() and so objects of that class can be 'called'.

Finally, the associated NumericVector constructor is here:

template <typename U>
Vector( const int& size, const U& u) {
    RCPP_DEBUG_2( "Vector<%d>( const int& size, const U& u )", RTYPE, size )
    Storage::set__( Rf_allocVector( RTYPE, size) ) ;
    fill_or_generate( u ) ;
}

And that fill_or_generate function would eventually dispatch down here:

template <typename T>
inline void fill_or_generate__impl( const T& gen, traits::true_type) {
    iterator first = begin() ;
    iterator last = end() ;
    while( first != last ) *first++ = gen() ;
}

So we can see that a (templated) generator function is supplied to fill the vector, and the corresponding operator() of the gen object is used to populate the vector -- ie, in this case, the stats::UnifGenerator object.

So then, the question is, how does it all come together in this call?

NumericVector x = runif(10);

I always forget this for some reason, but I believe this is basically copy-construction of x from the result of the runif(10) call, but this point is also elaborated on by @nrussell. But, my understanding:

  1. runif generates a NumericVector of length 10 with runif elements -- call this temporary right-hand object tmp,
  2. x is copy-constructed to be the same as the aforementioned tmp.

I believe that compilers will be able to elide that copy construction, so that x is in fact directly constructed from the result of runif(10) and so should be efficient (at least, at any reasonable optimization level), but I could be wrong ....

Upvotes: 3

Related Questions