Reputation: 25
Assume the following dummy data frame:
dt <- data.table(A=c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"),
B=c("e", "e", "e", "e", "e", "e", "f", "f", "f", "f", "f", "f"),
C=1:12,
D=13:24)
I'd like to calculate some stadistics (say, mean and standard deviation) per each numeric column ("C" and "D") and each time grouped by the factor columns c("A"), c("B"), and c("A", "B). In the actual data frame, I have about 40 numeric columns, 10 factor columns that group in different combinations and a large list of statistics I'd like to calculate. Based on the answer (by @thelatemail) I got from a previous question, I know I can use the code below to deal with factor groupings (by=) using a list:
groupList <- list(c("A", "B"), c("A"), c("B"))
out <- vector("list", 3)
out <- lapply(
groupList,
function(x) {
dt[, .(mean=mean(C), sd=sd(C)), by=x]
}
)
Now I'd like to go a step further and create a variable containing a list of the names of numeric columns in the data frame and use the name of that variable within the function above. I came out with the following code but unfortunately, it doesn't work. My idea is to use a loop to extract a value from measureList at each turn and place that value within the mean, sd functions. Any ideas? The loop is how I tend to think of these things but I'll be glad to get rid of it if it makes the code faster or more efficient (particularly because one of the factor columns I have has 90 levels). I'd appreciate any pointer to solve this problem! Thanks.
factorList <- list(c("A"), c("B"), c("A", "B"))
measureList <- list(c("C"), c("D"))
out <- vector("list", 2)
for(i in 1:length(measureList)){
out[[i]] <-lapply(
factorList,
function(x) {
dt[, .(mean=mean(eval(measureList[[i]])),
sd=sd(eval(measureList[[i]]))),
by = x]
}
)
}
Upvotes: 1
Views: 109
Reputation: 79338
You can use outer
with a vectorized function or use Map
as shown below:
m = function(x,y)dt[, .(mean=mean(get(y)), sd=sd(get(y))), by=x]
c(outer(factorList,measureList,Vectorize(m)))
or
Map(m,rep(factorList,each=length(measureList)),measureList)
EDIT:
TO HAVE THE NAMES:
m = function(x,y)setNames(dt[, .(mean(get(y)),sd(get(y))), by=x],
c(head(names(dt),length(x)),paste(c("mean","sd"),y,sep="_")))
c(outer(factorList,measureList,Vectorize(m)))
Upvotes: 1
Reputation: 83275
Another possibility is to use the new groupingsets
function from data.table:
groupingsets(dt
, j = lapply(.SD, function(x) list(mean(x), sd(x)))
, by = c('A','B')
, sets = factorList)[, type := c('mean','sd')][]
which gives:
A B C D type 1: a <NA> 2 14 mean 2: a <NA> 1 1 sd 3: b <NA> 5 17 mean 4: b <NA> 1 1 sd 5: c <NA> 8 20 mean 6: c <NA> 1 1 sd 7: d <NA> 11 23 mean 8: d <NA> 1 1 sd 9: <NA> e 3.5 15.5 mean 10: <NA> e 1.870829 1.870829 sd 11: <NA> f 9.5 21.5 mean 12: <NA> f 1.870829 1.870829 sd 13: a e 2 14 mean 14: a e 1 1 sd 15: b e 5 17 mean 16: b e 1 1 sd 17: c f 8 20 mean 18: c f 1 1 sd 19: d f 11 23 mean 20: d f 1 1 sd
Upvotes: 2
Reputation: 5415
This uses dplyr
and purrr
, but I think it works.
library(dplyr)
library(purrr)
combos <- expand.grid(factorList, measureList)
map2(combos[, 1],
combos[, 2],
~ dt %>% group_by_at(.x) %>% summarize_at(.y, funs(mean, sd)))
Upvotes: 1