Reputation: 616
I am trying to avoid using for()
loops for my problem. Let's say I have two vectors, for simplicities sake: x1 <- c(1,10,30)
and x2 <- c(11,31,40)
. These vectors contain reference points that point to certain intervals in my df
with variables that each have, in this case, 40 observations. So:
df(x1[1]:x2[1])
would be the first ten observations. df(x1[2]:x2[2])
would be next 20 observations with the last (30,40) representing the last 10. I want to calculate multiple statistics including mean
,std
and variance
, for example, for each of the intervals. for()
-loops will do the trick but its very slow. I was looking at apply
functions, but i can't seem to figure it out. mean(df[x1:x2])
also doesn't do the trick as it just takes the first value for x1
and x2
.
Any suggestions?
--tstev
Upvotes: 1
Views: 116
Reputation: 160407
I tend to be averse to using apply
on rows of a data.frame (as any mis-step upconverts everything to a character class). I've had to do something very similar to what you are asking in other code, and I opted for mapply
.
It does "something" with the first element of 2 (or more) vectors/lists, then does the same "something" with the second element of the same vectors/lists, etc. The "something" of course is defined by the first argument -- a function, similar to the other *apply
functions.
set.seed(42)
x1 <- c(1,10,30)
x2 <- c(11,31,40)
df <- as.data.frame(sample(40))
ret <- mapply(function(a,b) df[a:b,], x1, x2)
ret
## [[1]]
## [1] 37 40 11 31 24 19 26 5 22 32 14
## [[2]]
## [1] 32 14 21 27 7 13 36 25 3 38 12 35 23 18 17 2 8 6 29 30 10 15
## [[3]]
## [1] 10 15 39 4 33 1 28 34 9 16 20
From here it would be trivial to apply any other statistical summaries you want:
sapply(ret, function(x) c(mean=mean(x), sd=sd(x)))
## [,1] [,2] [,3]
## mean 23.72727 19.13636 19.00000
## sd 10.95528 11.14107 12.87633
(Or you could always extend the mapply
call to directly call these other functions.)
EDIT #1:
As suggested by @docendo discimus, Map
(and mapply
with SIMPLIFY=FALSE
) are slightly faster. For comparison:
set.seed(3)
x1 <- c(1,11,31)
x2 <- c(10,30,40)
df1 <- data.frame(V1 = sample(40))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)
library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: microseconds
## expr min lq mean median uq max neval
## dt 925.964 1006.9570 1176.5629 1081.4810 1184.7870 2582.434 100
## dplyr 1843.449 1967.0590 2154.9829 2042.2515 2185.2745 3839.960 100
## mapplyT 208.398 237.8500 272.8850 260.8315 286.2685 511.846 100
## mapplyF 187.424 208.6205 237.6805 225.1320 247.2215 445.801 100
## Map 191.441 215.7610 240.9025 231.6025 258.6005 441.785 100
I made explicit deep copies of the data.frame because setDT
modified the data.frame in place (ergo its efficiency) but mapply
and Map
were not able to cope with the data.table. (I baked the mean
,sd
,var
into my mapply
calls in order to compare apples with apples.)
EDIT #2:
The previous benchmarks look impressive and conclusive, but don't depict overhead of calls versus efficiency of large-data engines. Here's another run at things with more data.
When the individual subsets are fairly large -- i.e, fewer "chunks" from the source data.frame -- performance tends to balance out. Here I control chunk size with k
:
n <- 4000
k <- 100
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)
microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
## expr min lq mean median uq max neval
## dt 2.133063 2.297282 2.549046 2.435618 2.655842 4.305396 100
## dplyr 2.145558 2.401482 2.643981 2.552090 2.720102 4.374118 100
## mapplyT 2.599392 2.775883 3.135473 2.926045 3.156978 5.430832 100
## mapplyF 2.498540 2.738398 3.079050 2.882535 3.094057 7.041340 100
## Map 2.624382 2.725680 3.158272 2.894808 3.184869 6.533956 100
However, if the chunk size is reduced, the already-well-performing dplyr
comes out ahead by a good margin:
n <- 4000
k <- 10
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)
microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
## expr min lq mean median uq max neval
## dt 11.494443 12.45187 14.163123 13.716532 14.655883 62.424668 100
## dplyr 2.729696 3.05501 3.286876 3.148276 3.324098 4.832414 100
## mapplyT 25.195579 27.67426 28.488846 28.319758 29.247729 32.897811 100
## mapplyF 25.455742 27.42816 28.713237 28.038622 28.958785 76.587224 100
## Map 25.184870 27.32730 28.737281 28.198155 28.768237 77.830470 100
If you notice, dplyr
took roughly the same time for the smaller dataset as the larger. Nice.
There are three kinds of lies: lies, damned lies, and statistics. (Benjamin Disraeli) This applies equally well to benchmarks.
Upvotes: 2
Reputation: 280
Maybe instead of a for loop you could use apply twice? The desired computation can be wrapped into a function (in my example it is compute_mean
), and then one can call this function on pairs of indexes from x1
and x2
. Given that x1
and x2
are of the same length, it is easy to do with lapply
x1 <- c(1,10,30)
x2 <- c(10,30,40)
df <- as.data.frame(sample(40))
compute_mean <- function(df, ind1, ind2, i){
result <- apply( df[c(ind1[i]:ind2[i]), , drop = F], 2, mean )
return(result)
}
unlist(lapply(c(1:length(x1)), function(x){
out <- compute_mean(df = df, ind1 = x1, ind2 = x2, i = x)
return(out)
}))
Upvotes: 1
Reputation: 31161
Good opportunity to use Map
with the usefull each
from plyr
package:
library(plyr)
Map(function(u,v) each(mean, sd, var)(df[u:v,1]), x1, x2)
#[[1]]
# mean sd var
#17.90000 10.15929 103.21111
#[[2]]
# mean sd var
#19.14286 12.18313 148.42857
#[[3]]
# mean sd var
#24.81818 10.78720 116.36364
Data:
x1 <- c(1,10,30)
x2 <- c(10,30,40)
set.seed(3)
df <- as.data.frame(sample(40))
Upvotes: 1
Reputation: 1234
Here is a solution to your problem:
x1 <- c(1,10,30)
x2 <- c(10,30,40)
df <- as.data.frame(sample(40))
df2 <- data.frame(x1,x2)
apply(df2,1, function(x) mean(df[x[1]:x[2],]))
Just replace mean()
by sd()
or var()
to get standard deviation or variance. Don't forget the na.rm=TRUE
argument if you have missing data in df
.
Upvotes: 1