whirlaway
whirlaway

Reputation: 199

How to get rid of this for loop

I am told that there is no need to have a "for" loop in R at all. So, I want to see how I can get rid of this Python-like "for" loop in my R code:

  diff.vec = c()   # vector of differences
  for (index in 1:nrow(yrdf)) {   # yrdf is a data frame 
    if (index == numrows) {
      diff = NA  # because there is no entry "below" it
    } else {
      val_index = yrdf$Adj.Close[index] 
      val_next = yrdf$Adj.Close[index+1]
      diff = val_index - val_next   #  diff between two adjacent values
      diff = diff/yrdf$Adj.Close[index+1] * 100.0
    }
    diff.vec<-c(diff.vec,diff) # append to vector of differences
  }

Upvotes: 0

Views: 182

Answers (3)

Mark Peterson
Mark Peterson

Reputation: 9570

In my experience, there are three reasons to avoid a for loop. The first is that they can be difficult to read by others (if you share your code), and the apply family of functions can improve that (and be more explicit on returns). The second is a speed advantage made possible under some circumstances, particularly if you move to make the code run in parallel (e.g., most apply functions are embarrassingly parallel, while for loops take much more work to break apart).

However, it is the third reason that serves you here: vectorized solutions are often better than any of the above because it avoids repeated calls (e.g., your c at the end of the loop, the if check, etc.). Here, you can accomplish everything with a single vectorized call.

First, some sample data

set.seed(8675309)
yrdf <- data.frame(Adj.Close = rnorm(5))

Then, we multiply everything by 100, take the diff of the adjacent entries in Adj.Close and use vectorized division to divide by the following entry. Note that I need to pad with NA if (and only if) you want the outcome to be the same length as the input. If you don't want/need that NA at the end of the vector, it can be even easier.

100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)

Returns

[1] 238.06442 216.94975 130.41349 -90.47879        NA

And, to be explicit, here is the microbenchmark comparison:

myForLoop <- function(){
  numrows = nrow(yrdf)
  diff.vec = c()   # vector of differences
  for (index in 1:nrow(yrdf)) {   # yrdf is a data frame 
    if (index == numrows) {
      diff = NA  # because there is no entry "below" it
    } else {
      val_index = yrdf$Adj.Close[index] 
      val_next = yrdf$Adj.Close[index+1]
      diff = val_index - val_next   #  diff between two adjacent values
      diff = diff/yrdf$Adj.Close[index+1] * 100.0
    }
    diff.vec<-c(diff.vec,diff) # append to vector of differences
  }
  return(diff.vec)
}

microbenchmark::microbenchmark(
  forLoop = myForLoop()
  , vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)

gives:

Unit: microseconds
    expr    min     lq     mean median      uq     max neval
 forLoop 74.238 78.184 82.06786 81.287 84.3740 104.190   100
  vector 20.193 21.718 23.91824 22.716 24.0535  80.754   100

Note that the vector approach takes about 30% of the time of the for loop. This gets more important as the size of the data increases:

set.seed(8675309)
yrdf <- data.frame(Adj.Close = rnorm(10000))

microbenchmark::microbenchmark(
  forLoop = myForLoop()
  , vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)

gives

Unit: microseconds
    expr        min         lq        mean     median          uq        max neval
 forLoop 306883.977 315116.446 351183.7997 325211.743 361479.6835 545383.457   100
  vector    176.704    194.948    326.6135    219.512    236.9685   4989.051   100

Note that the difference in how these scale is massive -- the vector version takes less that 0.1% of the time to run. Here, this is likely because each call to c to add the new entry requires re-reading the full vector. A slight change can speed the for loop up a bit, but not get it all the way to the vector speed:

myForLoopAlt <- function(){
  numrows = nrow(yrdf)
  diff.vec = numeric(numrows)   # vector of differences
  for (index in 1:nrow(yrdf)) {   # yrdf is a data frame 
    if (index == numrows) {
      diff = NA  # because there is no entry "below" it
    } else {
      val_index = yrdf$Adj.Close[index] 
      val_next = yrdf$Adj.Close[index+1]
      diff = val_index - val_next   #  diff between two adjacent values
      diff = diff/yrdf$Adj.Close[index+1] * 100.0
    }
    diff.vec[index] <- diff # append to vector of differences
  }
  return(diff.vec)
}



microbenchmark::microbenchmark(
  forLoop = myForLoop()
  , newLoop = myForLoopAlt()
  , vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)

gives

Unit: microseconds
    expr        min         lq        mean      median          uq        max neval
 forLoop 304751.250 315433.802 354605.5850 325944.9075 368584.2065 528732.259   100
 newLoop 168014.142 179579.984 186882.7679 181843.7465 188654.5325 318431.949   100
  vector    169.569    208.193    331.2579    219.9125    233.3115   2956.646   100

That saved half of the time off the for loop approach, but is still way, way slower than the vectorized solution.

Upvotes: 1

troh
troh

Reputation: 1364

You can also use the lead function from the dplyr package to get the result that you want.

library(dplyr)
yrdf <- data.frame(Adj.Close = rnorm(100))
(yrdf$Adj.Close/lead(yrdf$Adj.Close)-1)*100

The calculation has been simplified from (a-b)/b to a/b-1. This is a vectorized operation instead of a for loop.

Upvotes: 0

Mikhail
Mikhail

Reputation: 153

yrdf <- data.frame(Adj.Close = rnorm(100))
numrows <- length(yrdf$Adj.Close)
diff.vec <- c((yrdf$Adj.Close[1:(numrows-1)] / yrdf$Adj.Close[2:numrows] - 1) * 100, NA)

Upvotes: 0

Related Questions