Reputation: 199
I am told that there is no need to have a "for" loop in R at all. So, I want to see how I can get rid of this Python-like "for" loop in my R code:
diff.vec = c() # vector of differences
for (index in 1:nrow(yrdf)) { # yrdf is a data frame
if (index == numrows) {
diff = NA # because there is no entry "below" it
} else {
val_index = yrdf$Adj.Close[index]
val_next = yrdf$Adj.Close[index+1]
diff = val_index - val_next # diff between two adjacent values
diff = diff/yrdf$Adj.Close[index+1] * 100.0
}
diff.vec<-c(diff.vec,diff) # append to vector of differences
}
Upvotes: 0
Views: 182
Reputation: 9570
In my experience, there are three reasons to avoid a for
loop. The first is that they can be difficult to read by others (if you share your code), and the apply
family of functions can improve that (and be more explicit on returns). The second is a speed advantage made possible under some circumstances, particularly if you move to make the code run in parallel (e.g., most apply
functions are embarrassingly parallel, while for
loops take much more work to break apart).
However, it is the third reason that serves you here: vectorized solutions are often better than any of the above because it avoids repeated calls (e.g., your c
at the end of the loop, the if
check, etc.). Here, you can accomplish everything with a single vectorized call.
First, some sample data
set.seed(8675309)
yrdf <- data.frame(Adj.Close = rnorm(5))
Then, we multiply everything by 100
, take the diff
of the adjacent entries in Adj.Close
and use vectorized division to divide by the following entry. Note that I need to pad with NA
if (and only if) you want the outcome to be the same length as the input. If you don't want/need that NA
at the end of the vector, it can be even easier.
100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
Returns
[1] 238.06442 216.94975 130.41349 -90.47879 NA
And, to be explicit, here is the microbenchmark
comparison:
myForLoop <- function(){
numrows = nrow(yrdf)
diff.vec = c() # vector of differences
for (index in 1:nrow(yrdf)) { # yrdf is a data frame
if (index == numrows) {
diff = NA # because there is no entry "below" it
} else {
val_index = yrdf$Adj.Close[index]
val_next = yrdf$Adj.Close[index+1]
diff = val_index - val_next # diff between two adjacent values
diff = diff/yrdf$Adj.Close[index+1] * 100.0
}
diff.vec<-c(diff.vec,diff) # append to vector of differences
}
return(diff.vec)
}
microbenchmark::microbenchmark(
forLoop = myForLoop()
, vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)
gives:
Unit: microseconds
expr min lq mean median uq max neval
forLoop 74.238 78.184 82.06786 81.287 84.3740 104.190 100
vector 20.193 21.718 23.91824 22.716 24.0535 80.754 100
Note that the vector
approach takes about 30% of the time of the for
loop. This gets more important as the size of the data increases:
set.seed(8675309)
yrdf <- data.frame(Adj.Close = rnorm(10000))
microbenchmark::microbenchmark(
forLoop = myForLoop()
, vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)
gives
Unit: microseconds
expr min lq mean median uq max neval
forLoop 306883.977 315116.446 351183.7997 325211.743 361479.6835 545383.457 100
vector 176.704 194.948 326.6135 219.512 236.9685 4989.051 100
Note that the difference in how these scale is massive -- the vector version takes less that 0.1% of the time to run. Here, this is likely because each call to c
to add the new entry requires re-reading the full vector. A slight change can speed the for loop up a bit, but not get it all the way to the vector speed:
myForLoopAlt <- function(){
numrows = nrow(yrdf)
diff.vec = numeric(numrows) # vector of differences
for (index in 1:nrow(yrdf)) { # yrdf is a data frame
if (index == numrows) {
diff = NA # because there is no entry "below" it
} else {
val_index = yrdf$Adj.Close[index]
val_next = yrdf$Adj.Close[index+1]
diff = val_index - val_next # diff between two adjacent values
diff = diff/yrdf$Adj.Close[index+1] * 100.0
}
diff.vec[index] <- diff # append to vector of differences
}
return(diff.vec)
}
microbenchmark::microbenchmark(
forLoop = myForLoop()
, newLoop = myForLoopAlt()
, vector = 100 * c(diff(yrdf$Adj.Close),NA) / c(yrdf$Adj.Close[2:nrow(yrdf)], NA)
)
gives
Unit: microseconds
expr min lq mean median uq max neval
forLoop 304751.250 315433.802 354605.5850 325944.9075 368584.2065 528732.259 100
newLoop 168014.142 179579.984 186882.7679 181843.7465 188654.5325 318431.949 100
vector 169.569 208.193 331.2579 219.9125 233.3115 2956.646 100
That saved half of the time off the for
loop approach, but is still way, way slower than the vectorized solution.
Upvotes: 1
Reputation: 1364
You can also use the lead
function from the dplyr
package to get the result that you want.
library(dplyr)
yrdf <- data.frame(Adj.Close = rnorm(100))
(yrdf$Adj.Close/lead(yrdf$Adj.Close)-1)*100
The calculation has been simplified from (a-b)/b to a/b-1. This is a vectorized operation instead of a for loop.
Upvotes: 0
Reputation: 153
yrdf <- data.frame(Adj.Close = rnorm(100))
numrows <- length(yrdf$Adj.Close)
diff.vec <- c((yrdf$Adj.Close[1:(numrows-1)] / yrdf$Adj.Close[2:numrows] - 1) * 100, NA)
Upvotes: 0