Nigel Stackhouse
Nigel Stackhouse

Reputation: 364

Vectorize for loop in R

I'm struggling to vectorize the repeated application of the following function, which I have currently implemented as a for loop. This small example is indicative of a problem with a larger dataset, for which vectorizing would allow for beneficial runtime improvements:

action = function(x,y,i) {

    firsttask = cumsum(x[which(x<y[i])])
    secondtask = mean(firsttask)
    thirdtask = min(firsttask[which(firsttask>secondtask)])
    fourthtask = length(firsttask)

    output = list(firsttask, data.frame(average=secondtask,
                                        min_over_mean=thirdtask,
                                        size=fourthtask))
    return(output)
}

thingofinterest = c(1:10)
limits = c(5:10)

test = vector("list", length = length(limits))
for(i in 1:length(limits)) {
test[[i]] = action(thingofinterest, limits, i)
}

test

I want to replace the for loop with a vectorized command, not any of the apply family of functions, as they do not always improve performance (I'm not suggesting there is anything wrong with for loops, I just need to optimize for speed in this case. See: Is R's apply family more than syntactic sugar?). How would I do this?

Upvotes: 4

Views: 2540

Answers (3)

Tensibai
Tensibai

Reputation: 15784

Just to add a comparison point with *apply familly I used this code (results verified with identical(f1(),f2()) f3 return a different layout).

After tests, the which call give some speed increase on large tingofinterest vector.

thingofinterest = c(1:100000)
limits = c(50:1000)

action1 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  thirdtask = mean(firsttask)
  secondtask = min(firsttask[which(firsttask>thirdtask)])
  fourthtask = length(firsttask)
  list(firsttask, min_over_mean=secondtask,
                                      average=thirdtask,
                                      size=fourthtask)
}

f1 <- function() {
  test = vector("list", length = length(limits))
  for(i in 1:length(limits)) {
    test[[i]] = action1(thingofinterest, limits, i)
  }
  return(test)
}


action2 <- function(x,y) {
  firsttask = cumsum(x[which(x<y)])
  thirdtask = mean(firsttask)
  secondtask = min(firsttask[which(firsttask>thirdtask)])
  fourthtask = length(firsttask)
  list(firsttask, min_over_mean=secondtask,
                  average=thirdtask,
                  size=fourthtask)
}

f2 <- function() {
  test <- lapply(limits,action2,x=thingofinterest)
  return(test)
}

f3 <- function() {
  test <- sapply(limits,action2,x=thingofinterest)
  return(test)
}

For 1M thingofinterest and 950 limits here is the results on my machine:

> microbenchmark(f1(),f2(),f3(),times=10)
Unit: seconds
 expr      min       lq     mean   median       uq      max neval
 f1() 4.303302 4.336767 4.373119 4.380383 4.403434 4.441945    10
 f2() 4.267922 4.327208 4.450175 4.399422 4.423191 5.041011    10
 f3() 4.240551 4.293855 4.412548 4.362949 4.468117 4.730717    10

So a cleanly done for loop is not that bad in this case.

I've the feeling there's probably a data.table way to do the "action" work in one pass but it's out of my knowledge area for now.

More on the speed topic, I see no way to really vectorize it. Those vectors not being subsets of each others, their cumsum can't be "cut" to avoid computing the common sequences.

As you say in comments limits are usually between 90 and 110 entries, parallel processing could be a correct approach to compute each iteration on a different core as each iteration is independent of the others. (Thinkink of mclapply, but there's maybe others more adapted to your use case)

Upvotes: 2

Joshua Ulrich
Joshua Ulrich

Reputation: 176648

You need to understand where the bottlenecks are in your code before you start trying to change it to make it faster. For example:

timer <- function(action, thingofinterest, limits) {
  st <- system.time({           # for the wall time
    Rprof(interval=0.01)        # Start R's profile timing
    for(j in 1:1000) {          # 1000 function calls
      test = vector("list")
      for(i in 1:length(limits)) {
        test[[i]] = action(thingofinterest, limits, i)
      }
    }
    Rprof(NULL)  # stop the profiler
  })
  # return profiling results
  list(st, head(summaryRprof()$by.total))
}
action = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = min(firsttask[which(firsttask>mean(firsttask))])
  thirdtask = mean(firsttask)
  fourthtask = length(firsttask)
  output = list(firsttask, data.frame(average=secondtask,
                                      min_over_mean=thirdtask,
                                      size=fourthtask))
  return(output)
}
timer(action, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   9.720   0.012   9.737 
# 
# [[2]]
#                 total.time total.pct self.time self.pct
# "system.time"         9.72    100.00      0.07     0.72
# "timer"               9.72    100.00      0.00     0.00
# "action"              9.65     99.28      0.24     2.47
# "data.frame"          8.53     87.76      0.84     8.64
# "as.data.frame"       5.50     56.58      0.44     4.53
# "force"               4.40     45.27      0.11     1.13

You can see that very little time is spent outside the call to your action function. Now, for is a special primitive and is therefore not captured by the profiler, but the total time reported by the profiler is very similar to the wall time, so there can't be a lot of time missing from the profiler time.

And the thing that takes the most time in your action function is the call to data.frame. Remove that, and you get an enormous speedup.

action1 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = mean(firsttask)
  thirdtask = min(firsttask[which(firsttask>mean(firsttask))])
  fourthtask = length(firsttask)
  list(task=firsttask, average=secondtask,
    min_over_mean=thirdtask, size=fourthtask)
}
timer(action1, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   1.020   0.000   1.021 
# 
# [[2]]
#               total.time total.pct self.time self.pct
# "system.time"       1.01    100.00      0.06     5.94
# "timer"             1.01    100.00      0.00     0.00
# "action"            0.95     94.06      0.17    16.83
# "which"             0.57     56.44      0.23    22.77
# "mean"              0.25     24.75      0.13    12.87
# "<"                 0.20     19.80      0.20    19.80

Now you can also get rid of one of the calls to mean and both calls to which.

action2 = function(x,y,i) {
  firsttask = cumsum(x[x < y[i]])
  secondtask = mean(firsttask)
  thirdtask = min(firsttask[firsttask > secondtask])
  fourthtask = length(firsttask)
  list(task=firsttask, average=secondtask,
    min_over_mean=thirdtask, size=fourthtask)
}
timer(action2, 1:1000, 50:100)
# [[1]]
#    user  system elapsed 
#   0.808   0.000   0.808 
# 
# [[2]]
#               total.time total.pct self.time self.pct
# "system.time"       0.80    100.00      0.12    15.00
# "timer"             0.80    100.00      0.00     0.00
# "action"            0.68     85.00      0.24    30.00
# "<"                 0.20     25.00      0.20    25.00
# "mean"              0.13     16.25      0.08    10.00
# ">"                 0.05      6.25      0.05     6.25

Now you can see there's a "significant" amount of time spent doing stuff outside your action function. I put significant in quotes because it's 15% of the runtime, but only 120 milliseconds. If your actual code took ~12 hours to run, this new action function would finish in ~1 hour.

The results would be marginally better if I pre-allocated the test list outside of the for loop in the timer function, but the call to data.frame is the biggest time-consumer.

Upvotes: 5

talat
talat

Reputation: 70266

Here's a little comparison in reference to my comment above. I've made the changes as in the comment (initialize test, change order in action and I removed the data.frame call in the list output of action if you can accept that):

library(microbenchmark)
microbenchmark(f0(), f1())

Unit: microseconds
 expr       min        lq      mean     median        uq       max neval
 f0() 14042.192 14730.036 16091.508 15168.3175 16993.631 28193.770   100
 f1()   894.555   928.738  1094.448   985.2865  1190.252  4710.675   100

These modifications brought a ~15 times speed up.

Functions and data for comparison:

action0 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  secondtask = min(firsttask[which(firsttask>mean(firsttask))])
  thirdtask = mean(firsttask)
  fourthtask = length(firsttask)
  output = list(firsttask, data.frame(min_over_mean=secondtask,
                                      average=thirdtask,
                                      size=fourthtask))
  return(output)
}

f0 <- function() {
  test = vector("list")
  for(i in 1:length(limits)) {
    test[[i]] = action0(thingofinterest, limits, i)
  }
}

thingofinterest = c(1:1000)
limits = c(50:100)

action1 = function(x,y,i) {
  firsttask = cumsum(x[which(x<y[i])])
  thirdtask = mean(firsttask)
  secondtask = min(firsttask[which(firsttask>thirdtask)])
  fourthtask = length(firsttask)
  list(firsttask, min_over_mean=secondtask,
                                      average=thirdtask,
                                      size=fourthtask)
}

f1 <- function() {
  test = vector("list", length = length(limits))
  for(i in 1:length(limits)) {
    test[[i]] = action1(thingofinterest, limits, i)
  }
}

Upvotes: 3

Related Questions