moman822
moman822

Reputation: 1954

Update data.table with mapply speed issue

I have a self-defined function, the results of which I want in a data.table. I need to apply this function to some variables from each row of another data.table. I have a method that works how I want it to, but it is quite slow and I am looking to see if there is an approach which will speed it up.

In my sample below, the important results are Column, which is generated in a while loop and varies in length given the input data, and Column2.

My approach has been to have the function append the results to an existing data.table using the update by reference, :=. To achieve this properly, I set the length of Column and Column2 to a known maximum, replace NAs with 0, and simply add to an existing data.table addTable like so: addTable[, First:=First + Column]

This method works with how I have applied the function over each row of the source data.table, using mapply. This way, I needn't worry about the actual product of the mapply call (some kind of matrix); it just updates addTable for each row it applies sample_fun to.

Here is a reproducible sample:

dt<-data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))    
addTable <- data.table(First=0, Second=0, Term=c(1:50))

sample_fun <- function(x, y, z) {
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  addTable[, First := First + Column]
  addTable[, Second := Second + Column2]
}

If I run this with dt at 50k rows, it takes ~ 30 seconds:

system.time(mapply(sample_fun2, dt$X, dt$Y, dt$Z))

Seems like a long time (longer with my real data/function). I originally thought this was due to the while loop, because of the ever-present warnings against explicit loops in R around these parts. However, upon testing sample_fun without the last two lines (where the data.table is updated), it clocked in under 1 second over 50k rows.

Long story short, why is this the slowest part if updating by reference is fast? And is there a better way to do this? Making sample_fun output a full data.table each time is considerably slower than what I have now.

Upvotes: 2

Views: 576

Answers (1)

David Arenburg
David Arenburg

Reputation: 92292

Few notes here:

  1. As it stands now, using data.table for your need could be an overkill (though not necessarily) and you could probably avoid it.
  2. You are growing objects in a loop (Column <- c(Column, x))- don't do that. In your case there is no need. Just create an empty vector of zeroes and you can get rid of most of your function.
  3. There is absolutely no need in creating Column2- it is just z- as R automatically will recycle it in order to fit it to the correct size
  4. No need to recalculate nrow(addTable) by row neither, that could be just an additional parameter.
  5. Your bigggest overhead is calling data.table:::`[.data.table` per row- it is a very expensive function. The := function has a very little overhead here. If you''ll replace addTable[, First := First + Column] ; addTable[, Second := Second + Column2] with just addTable$First + Column ; addTable$Second + Column2 the run time will be reduced from ~35 secs to ~2 secs. Another way to illustrate this is by replacing the two lines with set- e.g. set(addTable, j = "First", value = addTable[["First"]] + Column) ; set(addTable, j = "Second", value = addTable[["Second"]] + Column) which basically shares the source code with :=. This also runs ~ 2 secs
  6. Finally, it is better to reduce the amount of operations per row. You could try accumulating the result using Reduce instead of updating the actual data set per row.

Let's see some examples

Your original function timings

library(data.table)
dt <- data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))    
addTable <- data.table(First=0, Second=0, Term=c(1:50))

sample_fun <- function(x, y, z) {
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  addTable[, First := First + Column]
  addTable[, Second := Second + Column2]
}

system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
#  user  system elapsed 
# 30.71    0.00   30.78 

30 secs is pretty slow...

1- Let's try removing the data.table:::`[.data.table` overhead

sample_fun <- function(x, y, z) {
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  addTable$First + Column
  addTable$Second + Column2
}

system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user  system elapsed 
# 2.25    0.00    2.26 

^ That was much faster but didn't update the actual data set.

2- Now let's try replacing it with set which will have the same affect as := but without the data.table:::`[.data.table` overhead

sample_fun <- function(x, y, z, n) {  
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  set(addTable, j = "First", value = addTable[["First"]] + Column)
  set(addTable, j = "Second", value = addTable[["Second"]] + Column2)
}

system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user  system elapsed 
# 2.96    0.00    2.96 

^ Well, that was also much faster than 30 secs and had the exact same effect as :=

3- Let's try it without using data.table at all

dt <- data.frame(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))    
addTable <- data.frame(First=0, Second=0, Term=c(1:50))

sample_fun <- function(x, y, z) {
  Column <- NULL
  while(x>=1) {
    x <- x*y
    Column <- c(Column, x)
  }

  length(Column) <- nrow(addTable)
  Column[is.na(Column)] <- 0

  Column2 <- NULL
  Column2 <- rep(z, length(Column))

  return(list(Column, Column2))
}

system.time(res <- mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user  system elapsed 
# 1.34    0.02    1.36 

^ That's even faster

Now we can use Reduce combined with accumulate = TRUE in order to create those vectors

system.time(addTable$First <- Reduce(`+`, res[1, ], accumulate = TRUE)[[nrow(dt)]])
# user  system elapsed 
# 0.07    0.00    0.06 
system.time(addTable$Second <- Reduce(`+`, res[2, ], accumulate = TRUE)[[nrow(dt)]])
# user  system elapsed 
# 0.07    0.00    0.06 

Well, everything combined is now under 2 seconds (instead of 30 with your original function).

4- Further improvements could be to fix the other elements in your function (as pointed above), in other words, your function could be just

sample_fun <- function(x, y, n) {
  Column <- numeric(n)
  i <- 1L
  while(x >= 1) {
    x <- x * y
    Column[i] <- x
    i <- i + 1L
  }
  return(Column)
}

system.time(res <- Map(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user  system elapsed 
# 0.72    0.00    0.72 

^ Twice improvement in speed

Now, we didn't even bother creating Column2 as we already have it in dt$Z. We also used Map instead of mapply as it will be easier for Reduce to work with a list than a matrix.

The next step is similar to as before

system.time(addTable$First <- Reduce(`+`, res, accumulate = TRUE)[[nrow(dt)]])
# user  system elapsed 
# 0.07    0.00    0.07 

But we could improve this even further. Instead of using Map/Reduce we could create a matrix using mapply and then run matrixStats::rowCumsums over it (which is written in C++ internally) in order to calculate addTable$First)

system.time(res <- mapply(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user  system elapsed 
# 0.76    0.00    0.76 
system.time(addTable$First2 <- matrixStats::rowCumsums(res)[, nrow(dt)])
# user  system elapsed 
#    0       0       0 

While the final step is simply summing dt$Z

system.time(addTable$Second <- sum(dt$Z))
# user  system elapsed 
#    0       0       0

So eventually we went from ~30 secs to less than a second.


Some final notes

  1. As it seems like the main overhead remained in the function itself, you could also maybe try rewriting it using Rcpp as it seems like loops are inevitable in this case (though the overhead is not so big it seems).

Upvotes: 5

Related Questions