tstev
tstev

Reputation: 616

Suggestions for avoiding for-loops in R

I am trying to avoid using for() loops for my problem. Let's say I have two vectors, for simplicities sake: x1 <- c(1,10,30) and x2 <- c(11,31,40). These vectors contain reference points that point to certain intervals in my df with variables that each have, in this case, 40 observations. So:
df(x1[1]:x2[1]) would be the first ten observations. df(x1[2]:x2[2]) would be next 20 observations with the last (30,40) representing the last 10. I want to calculate multiple statistics including mean,std and variance, for example, for each of the intervals. for()-loops will do the trick but its very slow. I was looking at apply functions, but i can't seem to figure it out. mean(df[x1:x2]) also doesn't do the trick as it just takes the first value for x1 and x2.

Any suggestions?

--tstev

Upvotes: 1

Views: 116

Answers (4)

r2evans
r2evans

Reputation: 160407

I tend to be averse to using apply on rows of a data.frame (as any mis-step upconverts everything to a character class). I've had to do something very similar to what you are asking in other code, and I opted for mapply.

It does "something" with the first element of 2 (or more) vectors/lists, then does the same "something" with the second element of the same vectors/lists, etc. The "something" of course is defined by the first argument -- a function, similar to the other *apply functions.

set.seed(42)
x1 <- c(1,10,30)
x2 <- c(11,31,40)
df <- as.data.frame(sample(40))
ret <- mapply(function(a,b) df[a:b,], x1, x2)
ret
## [[1]]
##  [1] 37 40 11 31 24 19 26  5 22 32 14
## [[2]]
##  [1] 32 14 21 27  7 13 36 25  3 38 12 35 23 18 17  2  8  6 29 30 10 15
## [[3]]
##  [1] 10 15 39  4 33  1 28 34  9 16 20

From here it would be trivial to apply any other statistical summaries you want:

sapply(ret, function(x) c(mean=mean(x), sd=sd(x)))
##          [,1]     [,2]     [,3]
## mean 23.72727 19.13636 19.00000
## sd   10.95528 11.14107 12.87633

(Or you could always extend the mapply call to directly call these other functions.)

EDIT #1:

As suggested by @docendo discimus, Map (and mapply with SIMPLIFY=FALSE) are slightly faster. For comparison:

set.seed(3)
x1 <- c(1,11,31)
x2 <- c(10,30,40)
df1 <- data.frame(V1 = sample(40))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)

library(data.table)
library(dplyr)
library(microbenchmark)

microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
               dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
               mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
               mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
               Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: microseconds
##     expr      min        lq      mean    median        uq      max neval
##       dt  925.964 1006.9570 1176.5629 1081.4810 1184.7870 2582.434   100
##    dplyr 1843.449 1967.0590 2154.9829 2042.2515 2185.2745 3839.960   100
##  mapplyT  208.398  237.8500  272.8850  260.8315  286.2685  511.846   100
##  mapplyF  187.424  208.6205  237.6805  225.1320  247.2215  445.801   100
##      Map  191.441  215.7610  240.9025  231.6025  258.6005  441.785   100

I made explicit deep copies of the data.frame because setDT modified the data.frame in place (ergo its efficiency) but mapply and Map were not able to cope with the data.table. (I baked the mean,sd,var into my mapply calls in order to compare apples with apples.)

EDIT #2:

The previous benchmarks look impressive and conclusive, but don't depict overhead of calls versus efficiency of large-data engines. Here's another run at things with more data.

When the individual subsets are fairly large -- i.e, fewer "chunks" from the source data.frame -- performance tends to balance out. Here I control chunk size with k:

n <- 4000
k <- 100
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)

microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
               dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
               mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
               mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
               Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
##     expr      min       lq     mean   median       uq      max neval
##       dt 2.133063 2.297282 2.549046 2.435618 2.655842 4.305396   100
##    dplyr 2.145558 2.401482 2.643981 2.552090 2.720102 4.374118   100
##  mapplyT 2.599392 2.775883 3.135473 2.926045 3.156978 5.430832   100
##  mapplyF 2.498540 2.738398 3.079050 2.882535 3.094057 7.041340   100
##      Map 2.624382 2.725680 3.158272 2.894808 3.184869 6.533956   100

However, if the chunk size is reduced, the already-well-performing dplyr comes out ahead by a good margin:

n <- 4000
k <- 10
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)

microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
               dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
               mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
               mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
               Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
##     expr       min       lq      mean    median        uq       max neval
##       dt 11.494443 12.45187 14.163123 13.716532 14.655883 62.424668   100
##    dplyr  2.729696  3.05501  3.286876  3.148276  3.324098  4.832414   100
##  mapplyT 25.195579 27.67426 28.488846 28.319758 29.247729 32.897811   100
##  mapplyF 25.455742 27.42816 28.713237 28.038622 28.958785 76.587224   100
##      Map 25.184870 27.32730 28.737281 28.198155 28.768237 77.830470   100

If you notice, dplyr took roughly the same time for the smaller dataset as the larger. Nice.

There are three kinds of lies: lies, damned lies, and statistics. (Benjamin Disraeli) This applies equally well to benchmarks.

Upvotes: 2

Sergei
Sergei

Reputation: 280

Maybe instead of a for loop you could use apply twice? The desired computation can be wrapped into a function (in my example it is compute_mean), and then one can call this function on pairs of indexes from x1 and x2. Given that x1 and x2 are of the same length, it is easy to do with lapply

x1 <- c(1,10,30)
x2 <- c(10,30,40)
df <- as.data.frame(sample(40))

compute_mean <- function(df, ind1, ind2, i){
    result <- apply( df[c(ind1[i]:ind2[i]), , drop = F], 2, mean )
    return(result)
}

unlist(lapply(c(1:length(x1)), function(x){
    out <- compute_mean(df = df, ind1 = x1, ind2 = x2, i = x)
    return(out)
}))

Upvotes: 1

Colonel Beauvel
Colonel Beauvel

Reputation: 31161

Good opportunity to use Map with the usefull each from plyr package:

library(plyr)

Map(function(u,v) each(mean, sd, var)(df[u:v,1]), x1, x2)

#[[1]]
#    mean        sd       var
#17.90000  10.15929 103.21111  

#[[2]]
#    mean        sd       var
#19.14286  12.18313 148.42857

#[[3]]
#    mean        sd       var 
#24.81818  10.78720 116.36364

Data:

x1 <- c(1,10,30)
x2 <- c(10,30,40)
set.seed(3)
df <- as.data.frame(sample(40))

Upvotes: 1

Benoit
Benoit

Reputation: 1234

Here is a solution to your problem:

x1 <- c(1,10,30)
x2 <- c(10,30,40)

df <- as.data.frame(sample(40))
df2 <- data.frame(x1,x2)

apply(df2,1, function(x) mean(df[x[1]:x[2],]))

Just replace mean() by sd() or var() to get standard deviation or variance. Don't forget the na.rm=TRUE argument if you have missing data in df.

Upvotes: 1

Related Questions