Reputation: 364
I'm struggling to vectorize the repeated application of the following function, which I have currently implemented as a for loop. This small example is indicative of a problem with a larger dataset, for which vectorizing would allow for beneficial runtime improvements:
action = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = mean(firsttask)
thirdtask = min(firsttask[which(firsttask>secondtask)])
fourthtask = length(firsttask)
output = list(firsttask, data.frame(average=secondtask,
min_over_mean=thirdtask,
size=fourthtask))
return(output)
}
thingofinterest = c(1:10)
limits = c(5:10)
test = vector("list", length = length(limits))
for(i in 1:length(limits)) {
test[[i]] = action(thingofinterest, limits, i)
}
test
I want to replace the for loop with a vectorized command, not any of the apply
family of functions, as they do not always improve performance (I'm not suggesting there is anything wrong with for loops, I just need to optimize for speed in this case. See: Is R's apply family more than syntactic sugar?). How would I do this?
Upvotes: 4
Views: 2540
Reputation: 15784
Just to add a comparison point with *apply familly I used this code (results verified with identical(f1(),f2())
f3 return a different layout).
After tests, the which
call give some speed increase on large tingofinterest
vector.
thingofinterest = c(1:100000)
limits = c(50:1000)
action1 = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
thirdtask = mean(firsttask)
secondtask = min(firsttask[which(firsttask>thirdtask)])
fourthtask = length(firsttask)
list(firsttask, min_over_mean=secondtask,
average=thirdtask,
size=fourthtask)
}
f1 <- function() {
test = vector("list", length = length(limits))
for(i in 1:length(limits)) {
test[[i]] = action1(thingofinterest, limits, i)
}
return(test)
}
action2 <- function(x,y) {
firsttask = cumsum(x[which(x<y)])
thirdtask = mean(firsttask)
secondtask = min(firsttask[which(firsttask>thirdtask)])
fourthtask = length(firsttask)
list(firsttask, min_over_mean=secondtask,
average=thirdtask,
size=fourthtask)
}
f2 <- function() {
test <- lapply(limits,action2,x=thingofinterest)
return(test)
}
f3 <- function() {
test <- sapply(limits,action2,x=thingofinterest)
return(test)
}
For 1M thingofinterest and 950 limits here is the results on my machine:
> microbenchmark(f1(),f2(),f3(),times=10)
Unit: seconds
expr min lq mean median uq max neval
f1() 4.303302 4.336767 4.373119 4.380383 4.403434 4.441945 10
f2() 4.267922 4.327208 4.450175 4.399422 4.423191 5.041011 10
f3() 4.240551 4.293855 4.412548 4.362949 4.468117 4.730717 10
So a cleanly done for loop is not that bad in this case.
I've the feeling there's probably a data.table way to do the "action" work in one pass but it's out of my knowledge area for now.
More on the speed topic, I see no way to really vectorize it. Those vectors not being subsets of each others, their cumsum can't be "cut" to avoid computing the common sequences.
As you say in comments limits are usually between 90 and 110 entries, parallel processing could be a correct approach to compute each iteration on a different core as each iteration is independent of the others. (Thinkink of mclapply
, but there's maybe others more adapted to your use case)
Upvotes: 2
Reputation: 176648
You need to understand where the bottlenecks are in your code before you start trying to change it to make it faster. For example:
timer <- function(action, thingofinterest, limits) {
st <- system.time({ # for the wall time
Rprof(interval=0.01) # Start R's profile timing
for(j in 1:1000) { # 1000 function calls
test = vector("list")
for(i in 1:length(limits)) {
test[[i]] = action(thingofinterest, limits, i)
}
}
Rprof(NULL) # stop the profiler
})
# return profiling results
list(st, head(summaryRprof()$by.total))
}
action = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = min(firsttask[which(firsttask>mean(firsttask))])
thirdtask = mean(firsttask)
fourthtask = length(firsttask)
output = list(firsttask, data.frame(average=secondtask,
min_over_mean=thirdtask,
size=fourthtask))
return(output)
}
timer(action, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 9.720 0.012 9.737
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 9.72 100.00 0.07 0.72
# "timer" 9.72 100.00 0.00 0.00
# "action" 9.65 99.28 0.24 2.47
# "data.frame" 8.53 87.76 0.84 8.64
# "as.data.frame" 5.50 56.58 0.44 4.53
# "force" 4.40 45.27 0.11 1.13
You can see that very little time is spent outside the call to your action
function. Now, for
is a special primitive and is therefore not captured by the profiler, but the total time reported by the profiler is very similar to the wall time, so there can't be a lot of time missing from the profiler time.
And the thing that takes the most time in your action
function is the call to data.frame
. Remove that, and you get an enormous speedup.
action1 = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = mean(firsttask)
thirdtask = min(firsttask[which(firsttask>mean(firsttask))])
fourthtask = length(firsttask)
list(task=firsttask, average=secondtask,
min_over_mean=thirdtask, size=fourthtask)
}
timer(action1, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 1.020 0.000 1.021
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 1.01 100.00 0.06 5.94
# "timer" 1.01 100.00 0.00 0.00
# "action" 0.95 94.06 0.17 16.83
# "which" 0.57 56.44 0.23 22.77
# "mean" 0.25 24.75 0.13 12.87
# "<" 0.20 19.80 0.20 19.80
Now you can also get rid of one of the calls to mean
and both calls to which
.
action2 = function(x,y,i) {
firsttask = cumsum(x[x < y[i]])
secondtask = mean(firsttask)
thirdtask = min(firsttask[firsttask > secondtask])
fourthtask = length(firsttask)
list(task=firsttask, average=secondtask,
min_over_mean=thirdtask, size=fourthtask)
}
timer(action2, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 0.808 0.000 0.808
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 0.80 100.00 0.12 15.00
# "timer" 0.80 100.00 0.00 0.00
# "action" 0.68 85.00 0.24 30.00
# "<" 0.20 25.00 0.20 25.00
# "mean" 0.13 16.25 0.08 10.00
# ">" 0.05 6.25 0.05 6.25
Now you can see there's a "significant" amount of time spent doing stuff outside your action
function. I put significant in quotes because it's 15% of the runtime, but only 120 milliseconds. If your actual code took ~12 hours to run, this new action
function would finish in ~1 hour.
The results would be marginally better if I pre-allocated the test
list outside of the for
loop in the timer
function, but the call to data.frame
is the biggest time-consumer.
Upvotes: 5
Reputation: 70266
Here's a little comparison in reference to my comment above. I've made the changes as in the comment (initialize test, change order in action
and I removed the data.frame
call in the list output of action
if you can accept that):
library(microbenchmark)
microbenchmark(f0(), f1())
Unit: microseconds
expr min lq mean median uq max neval
f0() 14042.192 14730.036 16091.508 15168.3175 16993.631 28193.770 100
f1() 894.555 928.738 1094.448 985.2865 1190.252 4710.675 100
These modifications brought a ~15 times speed up.
Functions and data for comparison:
action0 = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = min(firsttask[which(firsttask>mean(firsttask))])
thirdtask = mean(firsttask)
fourthtask = length(firsttask)
output = list(firsttask, data.frame(min_over_mean=secondtask,
average=thirdtask,
size=fourthtask))
return(output)
}
f0 <- function() {
test = vector("list")
for(i in 1:length(limits)) {
test[[i]] = action0(thingofinterest, limits, i)
}
}
thingofinterest = c(1:1000)
limits = c(50:100)
action1 = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
thirdtask = mean(firsttask)
secondtask = min(firsttask[which(firsttask>thirdtask)])
fourthtask = length(firsttask)
list(firsttask, min_over_mean=secondtask,
average=thirdtask,
size=fourthtask)
}
f1 <- function() {
test = vector("list", length = length(limits))
for(i in 1:length(limits)) {
test[[i]] = action1(thingofinterest, limits, i)
}
}
Upvotes: 3