Remove NA values efficiently

Question

I need to remove NA values efficiently from vectors inside a function which is implemented with RcppEigen. I can of course do it using a for loop, but I wonder if there is a more efficient way.

Here is an example:

library(RcppEigen)
library(inline)

incl <- '
using  Eigen::Map;
using  Eigen::VectorXd;
typedef  Map  MapVecd;
'

body <- '
const MapVecd         x(as(xx)), y(as(yy));
VectorXd              x1(x), y1(y);
int                   k(0);
for (int i = 0; i < x.rows(); ++i) {
 if (x.coeff(i)==x.coeff(i) && y.coeff(i)==y.coeff(i)) {
  x1(k) = x.coeff(i);
  y1(k) = y.coeff(i);
  k++;
 };
};
x1.conservativeResize(k);
y1.conservativeResize(k);
return Rcpp::List::create(Rcpp::Named("x") = x1,
                          Rcpp::Named("y") = y1);
'

na.omit.cpp <- cxxfunction(signature(xx = "Vector", yy= "Vector"), 
                   body, "RcppEigen", incl)

na.omit.cpp(c(1.5, NaN, 7, NA), c(7.0, 1, NA, 3))
#$x
#[1] 1.5
#
#$y
#[1] 7

In my use case I need to do this about one million times in a loop (inside the Rcpp function) and the vectors could be quite long (let's assume 1000 elements).

PS: I've also investigated the route to find all NA/NaN values using x.array()==x.array(), but was unable to find a way to use the result for subsetting with Eigen.

mrip · Accepted Answer

Perhaps I am not understanding the question correctly, but within Rcpp, I don't see how you could possibly do this more efficiently than a for loop. for loops are generally inefficient in R only because iterating through a loop in R requires a lot of heavy interpreted machinery. But this is not the case once you are down at the C++ level. Even natively vectorized R functions ultimately are implemented with for loops in C. So the only way I can think to make this more efficient is to try to do it in parallel.

For example, here's a simple na.omit.cpp function that omits NA values from a single vector:

rcppfun<-"
Rcpp::NumericVector naomit(Rcpp::NumericVector x){
std::vector r(x.size());
int k=0;
  for (int i = 0; i < x.size(); ++i) {
    if (x[i]==x[i]) {
    r[k] = x[i];
    k++;
   }
  }
 r.resize(k);
 return Rcpp::wrap(r);    
}"

na.omit.cpp<-cppFunction(rcppfun)

This runs even more quickly than R's built in na.omit:

> set.seed(123)
> x<-1:10000
> x[sample(10000,1000)]<-NA
> y1<-na.omit(x)
> y2<-na.omit.cpp(x)
> all(y1==y2)
[1] TRUE
> require(microbenchmark)
> microbenchmark(na.omit(x),na.omit.cpp(x))
Unit: microseconds
           expr     min       lq   median      uq      max neval
     na.omit(x) 290.157 363.9935 376.4400 401.750 6547.447   100
 na.omit.cpp(x) 107.524 168.1955 173.6035 210.524  222.564   100

Remove NA values efficiently

Answers (2)

Related Questions