dataanalyst
dataanalyst

Reputation: 316

R Improve performance of function(s)

This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.

test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6), 
                   a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
                   a5.6=c(4,8,2,1,3,9))

test.dt[,rown:=as.numeric(row.names(test.dt))]

test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6), 
                   a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
                   a5.6=c(4,8,2,1,3,9))

test.df$rown <- as.numeric(row.names(test.df))

    > test.df
  strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1    1   2    1    2    3    5    4    1
2    1   1    2    4    1    1    8    2
3    2   5    3    6    2   15    2    3
4    3   5    4    8    4   10    1    4
5    5   5    5   10    5   12    3    5
6    2   4    6   12    1   10    9    6

I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6

The output should be a vector like this

> k
       1        2        3        4        5        6 
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667 

Here, is what I tried:

Solution 1: This uses the data.table and applies a function over it.

func.dt <- function(rown, x, y) {
      tmp  <- paste0("a", x, "." , x+1)
      tmp1 <- paste0("a", y, "." , y+1)
      rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
      }
    k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]

Solution 2: This uses the data.frame and applies a function over it.

func.df <- function(rown, x, y) {
  rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
  }
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)

Solution 3: This uses the data.frame and loops through it.

    test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
    test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
    }

Benchmarking shows that Solution 2 is the fastest.

test replications elapsed relative user.self sys.self user.child sys.child
1 sol1          100    0.67    4.786      0.67        0         NA        NA
2 sol2          100    0.14    1.000      0.14        0         NA        NA
3 sol3          100    0.15    1.071      0.16        0         NA        NA

But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.

Upvotes: 2

Views: 377

Answers (3)

Jacob H
Jacob H

Reputation: 4513

Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.

library(Rcpp)

cppFunction('NumericVector MYrcpp(NumericMatrix x) {
  int nrow = x.nrow(), ncol = x.ncol();
  NumericVector out(nrow);

  for (int i = 0; i < nrow; i++) {
    double avg = 0;

    int start = x(i,0);
    int end = x(i,1);

    int N = end - start + 1;

    while(start<=end){

      avg += x(i, start + 1); 
    
      start = start + 1;

    }

    out[i] = avg/N;

  }
  return out;
}')

For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))

Let's see if it works...

 MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667

How fast is it?

Unit: microseconds
          expr      min        lq      mean   median       uq       max neval
          f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951  4735.851   100
          f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882   100
          f4()  281.541  315.2680  364.2197  345.328  375.877  1089.994   100
 MYrcpp(testM)    3.422   10.0205   16.7708   19.552   21.507    56.700   100

Where f2(), f3() and f4() are defined as

f2 <- function(){
  func.df <- function(rown, x, y) {
    rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
  }
  k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}

f3 <- function(){
  test.ave <- rep(NA, length(test.df$strt))
  for (i in 1 : length(test.df$strt)) {
    test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
  }
}

f4 <- function(){
  lapply(
    apply(test.df,1, function(x){
      x[(x[1]+2):(x[2]+2)]}),
    mean)
}

That's roughly a 20x increase over the fastest.

Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this

Now let's see how it scales.

N = 5e3
test.df <- data.frame(strt = 1, 
                 end = sample(5, N, replace = TRUE), 
                 a1.2 = sample(3, N, replace = TRUE), 
                 a2.3 = sample(7, N, replace = TRUE), 
                 a3.4 = sample(14, N, replace = TRUE),
                 a4.5 = sample(8, N, replace = TRUE),
                 a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))


test.dt <- as.data.table(test.df)

microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
          expr       min         lq        mean     median          uq       max neval
          f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49   100
 MYrcpp(testM)   196.003    216.533    242.6732    235.107    261.0125    499.54   100

With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.

I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.

Update: R equivalent of C++ code

MYr <- function(x){
  nrow <- nrow(x)
  ncol <- ncol(x)
  out <- matrix(NA, nrow = 1, ncol = nrow)

  for(i in 1:nrow){
    avg <- 0

    start <- x[i,1]
    end <- x[i,2]

    N <- end - start + 1

    while(start<=end){
  
      avg <- avg + x[i, start + 2]
  
      start = start + 1
  
    }

    out[i] <- avg/N

  }

  out

}

Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences

  1. The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
  2. In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
  3. When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
  4. ++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b

Upvotes: 3

Joshua Ulrich
Joshua Ulrich

Reputation: 176648

I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.

f_jmu <- compiler::cmpfun({function(m) {
  # remove start/end columns from 'm' matrix
  ma <- m[,-(1:2)]
  # column index for each row in 'ma' matrix
  cm <- col(ma)
  # logical index of whether we need the column for each row
  i <- cm >= m[,1L] & cm <= m[,2L]
  # multiply the input matrix by the index matrix and sum it
  # divide by the sum of the index matrix to get the mean
  rowSums(i*ma) / rowSums(i)
}})

The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.

set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L, 
                 end = sample(5, N, replace = TRUE), 
                 a1.2 = sample(3, N, replace = TRUE), 
                 a2.3 = sample(7, N, replace = TRUE), 
                 a3.4 = sample(14, N, replace = TRUE),
                 a4.5 = sample(8, N, replace = TRUE),
                 a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)

Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.

R> system.time(st1 <- MYrcpp(test.m))
   user  system elapsed 
  0.900   0.216   1.112 
R> system.time(st2 <- f_jmu(test.m))
   user  system elapsed 
  6.804   0.756   7.560 
R> identical(st1, st2)
[1] TRUE

Upvotes: 3

shayaa
shayaa

Reputation: 2797

My solution is the first one in the benchmark

library(microbenchmark)
microbenchmark(
  lapply(
    apply(test.df,1, function(x){
        x[(x[1]+2):(x[2]+2)]}),
    mean),
  test.dt[, func.dt(rown, strt, end), by=.(rown)]
)

      min        lq      mean   median       uq      max neval
  138.654  175.7355  254.6245  201.074  244.810 3702.443   100
 4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286   100

It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.

Upvotes: 2

Related Questions