CRP
CRP

Reputation: 435

Speed up r loop string match (vector vs. data.frame)

I'm trying to optimize a loop in r that counts the number of string matches of each element in a vector regarding each row in a data frame. In small datasets it works pretty good (~15 min; 11 columns, 914 rows). However, it takes days for running in huge datasets (914 columns, 18.000 rows). Here's my extremely basic loop:

for (j in 1: dim(pddbnh)[1]){
  for (i in 1:dim(pidf)[1]){
    richa[i,j] <- length(pidf[i,][pidf[i,] == row.names(pddbnh)[j] ])
   }
}

I'm wondering if anyone knows how to optimize this loop using other approach (e.g. vectorization). Any solution would be much appreciated!

UPDATE Here's a small dataset. That's the fastest one

 df<-data.frame(replicate(10,sample(c("sp1", "sp2"),10,rep=TRUE)))
 vec<-c("sp1", "sp2")
 richa <- data.frame()

  for (j in 1:length(vec)){
    for (i in 1:dim(df)[1]){
     richa[i,j] <- length(df[i,][df[i,] == vec[j] ])
     }
    }

Upvotes: 0

Views: 960

Answers (1)

rosscova
rosscova

Reputation: 5590

Here's a method using lapply (see below for even faster):

richa <- lapply( X = vec, FUN = function(x) rowSums( df == x ) )
richa <- do.call( cbind, richa )

A quick microbenchmark on the small dataset you've provided shows this at about 10x faster than your for loop method.

Just to add, this could easily be multi-threaded as well for really big datasets, using either parallel::mclapply or plyr::laply (with parallel = TRUE). It takes a little extra work, but might be worthwhile for those 18000 x 914 datasets you've got.

EDIT TO ADD: since you've got a few for loops going there (and since I'm learning Rcpp, and keen to practice) here's an even faster solution using Rcpp. Here's the function definition (which needs to be compiled once):

Rcpp::cppFunction(' IntegerMatrix charCrossCheck( CharacterMatrix df,
                          CharacterVector vec ) {

              IntegerMatrix output( df.nrow(), vec.size() );

              for (int j=0; j < vec.size(); ++j ){
                  for (int i=0; i < df.nrow(); ++i ){
                      int count = 0;
                      for( int k=0; k < df.ncol(); k++ ){
                          if( df(i,k) == vec[j] ) {
                              count++;
                          }
                      }
                      output(i,j) = count;
                  }
              }
              return output;

              } ')

Then you can call that function with:

richa <- charCrossCheck( as.matrix(df), vec )

Rcpp is very fast here. Microbenchmark on your very small sample shows it more than 3x faster than my lapply solution above, and about 38x faster than your for loops in R.

Interestingly, expanding the input data out to a df size 4000x4000, and a vec of length 10, both the Rcpp and lapply methods complete the job in very similar times (3.4 and 3.9 seconds respectively). On a dataset the size you mention (18000 rows x 914 columns, with vec length 2), both solutions are well under 1 second. Not bad either way!

Upvotes: 3

Related Questions