Reputation: 125
This question is mostly for my learning of good R programming practice. I'd like to repeat the replicate
function with different inputs on a single variable for the expression within the replicate function. I can easily do this with a for
loop, but I've heard repeatedly that if I'm using for loops in R, I'm doing it wrong. Is there a way to repeat a call to replicate
using different inputs without a loop? After that, I have my best attempt so far.
Working Code with Loop:
set.seed(1564) #Birth of Galileo!
x <- rnorm(1000, 15, 3)
y <- 2*x + rnorm(1000, 0, 5)
cor(x, y)
cor.fxn <- function(N, x, y) {
samp.row <- sample(1:1000, N)
cor(x[samp.row], y[samp.row])
}
N.list <- seq(3,20)
cor.list <- rep(NA_real_, length(N.list))
for (N in N.list){
cor.resamp <- replicate(1000, cor.fxn(N, x, y))
cor.list[N-2] <- median(cor.resamp)
}
plot(N.list, cor.list)
Nonfunctional best attempt without loop:
set.seed(1564) #Birth of Galileo!
x <- rnorm(1000, 15, 3)
y <- 2*x + rnorm(1000, 0, 5)
X <- list(3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
eggs <- lapply(X, replicate, n=1000, expr=cor.fxn, x=x, y=y)
Which will error out:
Error in FUN(X[[i]], ...) :
unused arguments (x = c(9.17486389116665, 13.6573453081421, 12.2166561575586, 11.3619489970582, 17.9998611075272, 11.1171958860255, 20.4489048239365, 16.8825343591062, 12.9990097472942, 12.5617129892976, 10.9833420846924, 13.7732692244654, 16.9641205588413, 11.1309409503371, 11.7859737745279,...
Thank you for any assistance.
Upvotes: 0
Views: 1762
Reputation: 3188
Looping is slow in R, but the other part that you probably didn't hear is that you should be vectorizing your operations. *apply family functions are not inherently faster than for loops. Let's look at some benchmarks
# Boiler plate code used for both functions
cor.fxn <- function(N, x, y) {
samp.row <- sample(1:1000, N)
cor(x[samp.row], y[samp.row])
}
set.seed(1564) #Birth of Galileo!
x <- rnorm(1000, 15, 3)
y <- 2*x + rnorm(1000, 0, 5)
N.list <- seq(3,20)
# Using 'for loop'
foo_a = function(....) {cor.list <- rep(NA_real_, length(N.list));
for (N in N.list) {
cor.resamp <- replicate(1000, cor.fxn(N, x, y))
cor.list[N-2] <- median(cor.resamp)
}
}
# Using sapply
foo_b = function(...) sapply(3:20, function(n) median(replicate(1000, cor.fxn(n, x, y))))
microbenchmark(foo_a(), foo_b(), times = 100L)
Looks like there is no difference in timing between the two methods, as pointed out from above.
Unit: milliseconds
expr min lq mean median uq max neval
foo_a() 939.7068 1041.964 1140.159 1146.065 1243.540 1367.411 100
foo_b() 936.5962 1045.023 1138.337 1133.074 1239.099 1334.430 100
This specific test case can't be vectorized since you are taking the median of 1000 runs of a process. The whole point of this post is to point out that for loops are not inherantly worse than *apply family functions in R. However, you should always seek a vectorized solution over a looping/apply solution when possible.
Upvotes: 1