Reputation: 308
I am working on a custom bootstrap algorithm for a specific problem, and as I want a large number of replicates I do care about performance. In this regard, I have some questions on how to use runif properly. I'm aware that I could run benchmarks myself, but C++ optimization tends to be difficult and I would also like to understand the reasons for any difference.
First question:
Is the first code block faster than the second?
for (int i = 0; i < n_boot; i++) {
new_random = runif(n); //new_random is pre-allocated in class
// do something with the random numbers
}
for (int i = 0; i < n_boot; i++) {
NumericVector new_random = runif(n);
// do something with the random numbers
}
It probably comes down to whether runif fills the left side or if it allocates and passes a new NumericVector.
Second question:
If both versions allocate a new vector, can I improve things by generating one random number at a time in scalar mode?
In case you are wondering, memory allocation takes up a sizable part of my processing time. I have reduced runtime by 30% by optimizing other unnecessary memory allocations away, so it does matter.
Upvotes: 3
Views: 1607
Reputation: 18612
I set up the following struct
to try to represent your scenario accurately & facilitate the benchmarking:
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
struct runif_test {
size_t runs;
size_t each;
runif_test(size_t runs, size_t each)
: runs(runs), each(each)
{}
// Your first code block
void pre_init() {
Rcpp::NumericVector v = no_init();
for (size_t i = 0; i < runs; i++) {
v = Rcpp::runif(each);
}
}
// Your second code block
void post_init() {
for (size_t i = 0; i < runs; i++) {
Rcpp::NumericVector v = Rcpp::runif(each);
}
}
// Generate 1 draw at a time
void gen_runif() {
Rcpp::NumericVector v = no_init();
for (size_t i = 0; i < runs; i++) {
std::generate_n(v.begin(), each, []() -> double {
return Rcpp::as<double>(Rcpp::runif(1));
});
}
}
// Reduce overhead of pre-allocated vector
inline Rcpp::NumericVector no_init() {
return Rcpp::NumericVector(Rcpp::no_init_vector(each));
}
};
where I benchmarked the following exported functions:
// [[Rcpp::export]]
void do_pre(size_t runs, size_t each) {
runif_test obj(runs, each);
obj.pre_init();
}
// [[Rcpp::export]]
void do_post(size_t runs, size_t each) {
runif_test obj(runs, each);
obj.post_init();
}
// [[Rcpp::export]]
void do_gen(size_t runs, size_t each) {
runif_test obj(runs, each);
obj.gen_runif();
}
Here are the results I got:
R> microbenchmark::microbenchmark(
do_pre(100, 10e4)
,do_post(100, 10e4)
,do_gen(100, 10e4)
,times=100L)
Unit: milliseconds
expr min lq mean median uq max neval
do_pre(100, 100000) 109.9187 125.0477 145.9918 136.3749 152.9609 337.6143 100
do_post(100, 100000) 103.1705 117.1109 132.9389 130.4482 142.7319 204.0951 100
do_gen(100, 100000) 810.5234 911.3586 1005.9438 986.8348 1062.7715 1501.2933 100
R> microbenchmark::microbenchmark(
do_pre(100, 10e5)
,do_post(100, 10e5)
,times=100L)
Unit: seconds
expr min lq mean median uq max neval
do_pre(100, 1000000) 1.355160 1.614972 1.740807 1.723704 1.815953 2.408465 100
do_post(100, 1000000) 1.198667 1.342794 1.443391 1.429150 1.519976 2.042511 100
So, assuming I interpreted / accurately represented your second question,
If both versions allocate a new vector, can I improve things by generating one random number at a time in scalar mode?
with my gen_runif()
member function, I think we can confidently say that this is not the optimal approach - ~ 7.5x slower than the other two functions.
More importantly, to address your first question, it seems that it is a little faster to just initialize & assign a new NumericVector
to the output of Rcpp::runif(n)
. I'm certainly no C++ expert, but I believe the second method (assigning to a new local object) was faster than the first because of copy elision. In the second case, it looks as though two objects are being created - the object on the left of the =
, v
, and a (temporary? rvalue?) object on the right side of the =
, which is the result of Rcpp::runif()
. In reality though, the compiler will most likely optimize this unnecessary step out - which I think is explained in this passage from the article I linked:
When a nameless temporary, not bound to any references, would be moved or copied into an object of the same type ... the copy/move is omitted. When that temporary is constructed, it is constructed directly in the storage where it would otherwise be moved or copied to.
This was, at least, how I interpreted the results. Hopefully someone who is more well-versed in the language can confirm / deny / correct this conclusion.
Upvotes: 6
Reputation: 21315
Adding to to @nrussell's answer with some implementation details...
Use the source, Luke! definitely applies here, so let's look at the implementation of Rcpp::runif
here:
inline NumericVector runif( int n, double min, double max ){
if (!R_FINITE(min) || !R_FINITE(max) || max < min) return NumericVector( n, R_NaN ) ;
if( min == max ) return NumericVector( n, min ) ;
return NumericVector( n, stats::UnifGenerator( min, max ) ) ;
}
We see that an interesting constructor of NumericVector
is being called with a stats::UnifGenerator
object. That class's definition is here:
class UnifGenerator__0__1 : public ::Rcpp::Generator<double> {
public:
UnifGenerator__0__1() {}
inline double operator()() const {
double u;
do {u = unif_rand();} while (u <= 0 || u >= 1);
return u;
}
} ;
So, that class just a functor -- it implements operator()
and so objects of that class can be 'called'.
Finally, the associated NumericVector
constructor is here:
template <typename U>
Vector( const int& size, const U& u) {
RCPP_DEBUG_2( "Vector<%d>( const int& size, const U& u )", RTYPE, size )
Storage::set__( Rf_allocVector( RTYPE, size) ) ;
fill_or_generate( u ) ;
}
And that fill_or_generate
function would eventually dispatch down here:
template <typename T>
inline void fill_or_generate__impl( const T& gen, traits::true_type) {
iterator first = begin() ;
iterator last = end() ;
while( first != last ) *first++ = gen() ;
}
So we can see that a (templated) generator function is supplied to fill the vector, and the corresponding operator()
of the gen
object is used to populate the vector -- ie, in this case, the stats::UnifGenerator
object.
So then, the question is, how does it all come together in this call?
NumericVector x = runif(10);
I always forget this for some reason, but I believe this is basically copy-construction of x
from the result of the runif(10)
call, but this point is also elaborated on by @nrussell. But, my understanding:
runif
generates a NumericVector
of length 10 with runif
elements -- call this temporary right-hand object tmp
,x
is copy-constructed to be the same as the aforementioned tmp
.I believe that compilers will be able to elide that copy construction, so that x
is in fact directly constructed from the result of runif(10)
and so should be efficient (at least, at any reasonable optimization level), but I could be wrong ....
Upvotes: 3