Reputation: 1954
I have a self-defined function, the results of which I want in a data.table. I need to apply this function to some variables from each row of another data.table. I have a method that works how I want it to, but it is quite slow and I am looking to see if there is an approach which will speed it up.
In my sample below, the important results are Column, which is generated in a while loop and varies in length given the input data, and Column2.
My approach has been to have the function append the results to an existing data.table using the update by reference, :=. To achieve this properly, I set the length of Column and Column2 to a known maximum, replace NAs with 0, and simply add to an existing data.table addTable like so: addTable[, First:=First + Column]
This method works with how I have applied the function over each row of the source data.table, using mapply. This way, I needn't worry about the actual product of the mapply call (some kind of matrix); it just updates addTable for each row it applies sample_fun to.
Here is a reproducible sample:
dt<-data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.table(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable[, First := First + Column]
addTable[, Second := Second + Column2]
}
If I run this with dt at 50k rows, it takes ~ 30 seconds:
system.time(mapply(sample_fun2, dt$X, dt$Y, dt$Z))
Seems like a long time (longer with my real data/function). I originally thought this was due to the while loop, because of the ever-present warnings against explicit loops in R around these parts. However, upon testing sample_fun without the last two lines (where the data.table is updated), it clocked in under 1 second over 50k rows.
Long story short, why is this the slowest part if updating by reference is fast? And is there a better way to do this? Making sample_fun output a full data.table each time is considerably slower than what I have now.
Upvotes: 2
Views: 576
Reputation: 92292
Few notes here:
data.table
for your need could be an overkill (though not necessarily) and you could probably avoid it.Column <- c(Column, x)
)- don't do that. In your case there is no need. Just create an empty vector of zeroes and you can get rid of most of your function.Column2
- it is just z
- as R automatically will recycle it in order to fit it to the correct sizenrow(addTable)
by row neither, that could be just an additional parameter.:=
function has a very little overhead here. If you''ll replace addTable[, First := First + Column] ; addTable[, Second := Second + Column2]
with just addTable$First + Column ; addTable$Second + Column2
the run time will be reduced from ~35 secs to ~2 secs. Another way to illustrate this is by replacing the two lines with set
- e.g. set(addTable, j = "First", value = addTable[["First"]] + Column) ; set(addTable, j = "Second", value = addTable[["Second"]] + Column)
which basically shares the source code with :=
. This also runs ~ 2 secsReduce
instead of updating the actual data set per row.Let's see some examples
Your original function timings
library(data.table)
dt <- data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.table(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable[, First := First + Column]
addTable[, Second := Second + Column2]
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 30.71 0.00 30.78
30 secs is pretty slow...
1- Let's try removing the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable$First + Column
addTable$Second + Column2
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 2.25 0.00 2.26
^ That was much faster but didn't update the actual data set.
2- Now let's try replacing it with set
which will have the same affect as :=
but without the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z, n) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
set(addTable, j = "First", value = addTable[["First"]] + Column)
set(addTable, j = "Second", value = addTable[["Second"]] + Column2)
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 2.96 0.00 2.96
^ Well, that was also much faster than 30 secs and had the exact same effect as :=
3- Let's try it without using data.table
at all
dt <- data.frame(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.frame(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
return(list(Column, Column2))
}
system.time(res <- mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 1.34 0.02 1.36
^ That's even faster
Now we can use Reduce
combined with accumulate = TRUE
in order to create those vectors
system.time(addTable$First <- Reduce(`+`, res[1, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
system.time(addTable$Second <- Reduce(`+`, res[2, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
Well, everything combined is now under 2 seconds (instead of 30 with your original function).
4- Further improvements could be to fix the other elements in your function (as pointed above), in other words, your function could be just
sample_fun <- function(x, y, n) {
Column <- numeric(n)
i <- 1L
while(x >= 1) {
x <- x * y
Column[i] <- x
i <- i + 1L
}
return(Column)
}
system.time(res <- Map(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user system elapsed
# 0.72 0.00 0.72
^ Twice improvement in speed
Now, we didn't even bother creating Column2
as we already have it in dt$Z
. We also used Map
instead of mapply
as it will be easier for Reduce
to work with a list
than a matrix
.
The next step is similar to as before
system.time(addTable$First <- Reduce(`+`, res, accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.07
But we could improve this even further. Instead of using Map
/Reduce
we could create a matrix
using mapply
and then run matrixStats::rowCumsums
over it (which is written in C++ internally) in order to calculate addTable$First
)
system.time(res <- mapply(sample_fun, dt$X, dt$Y, nrow(addTable)))
# user system elapsed
# 0.76 0.00 0.76
system.time(addTable$First2 <- matrixStats::rowCumsums(res)[, nrow(dt)])
# user system elapsed
# 0 0 0
While the final step is simply summing dt$Z
system.time(addTable$Second <- sum(dt$Z))
# user system elapsed
# 0 0 0
So eventually we went from ~30 secs to less than a second.
Some final notes
Upvotes: 5