Reputation: 35
I have a list of three data frames and would like to generate another list of three data frames whose rows consist of each of the values of a grouping variable (g1) and the means of six variables by the g1 variable. The twist is that I would like to calculate the means for the three continuous variables only when the value of the corresponding dummy variable equals 1.
Reproducible example:
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
u[[1]]
> u
[[1]]
g1 dummy1 dummy2 dummy3 contin1 contin2 contin3
1 fj 1 0 0 199 18 61
2 fj 1 0 0 91 158 28
3 fj 1 1 0 147 67 190
4 a 1 0 1 181 105 22
5 fj 0 1 0 14 16 156
6 a 0 0 0 178 14 98
7 g 0 0 1 116 97 30
8 g 1 1 1 48 31 144
9 g 0 0 0 60 21 112
10 g 0 1 0 95 145 199
I would like to calculate the mean of contin1 only when dummy1 = 1, mean of contin2 only when dummy2 = 1, and mean of contin3 only when dummy3 = 1
The output I WANT for the first list:
> rates
[[1]]
x[, 1] V1 V2 V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1 a 0.50 0.0 0.5 a 181 a NA a 22
2 fj 0.75 0.5 0.0 fj 145.67 fj 41.5 fj NA
3 g 0.25 0.5 0.5 g 48 g 88 g 87
What I have tried:
rates <- lapply(u, function(x) {
cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
})
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 2
I understand that this error is coming from cbind, because cbind fails whenever you try to cbind objects with different numbers of rows. (The column x[, 6] has three rows whereas x[, 7] and x[, 8] have two.) I guess I was hoping that there was some way for aggregate to keep one row per grouping variable, which would mean that I would have the same number of rows and the cbind would work. Perhaps this is not possible per the R documentation?: "Rows with missing values in any of the by variables will be omitted from the result."
I have cafefully read the documentation for aggregate. The following two posts address similar issues but not using different subsets of the data to calculate the means.
R: Calculate means for subset of a group and Means from a list of data frames in R
Any suggestions would be hugely appreciated.
Upvotes: 1
Views: 330
Reputation: 887531
Another option would be to change the format from 'wide' to 'long' and reconvert back to 'wide' after getting the 'mean' values. For multiple value columns, this is now possible with melt
, dcast
from the devel version of data.table
i.e. v1.9.5
. It can be installed from here
. (Used the same dataset from @akhmed's post).
We can melt
the datasets within the list ('u') by specifying the index of columns ('dummy' and 'contin') in measure.vars
as a list. Get the mean of 'dummy' and 'contin' columns grouped by 'g1', and 'variable' (created from the 'melt'), dcast
from long
to wide
by specifying the value.vars as 'dummyMean' and 'continMean'.
res <- lapply(u, function(x) {
x1 <- melt(setDT(x), measure.vars=list(2:4,5:7),
value.name=c('dummy', 'contin'))
x2 <- x1[, list(dummyMean = mean(dummy, na.rm=TRUE),
continMean = mean(contin[dummy==1], na.rm=TRUE)),
by=list(g1, variable)]
dcast(x2, g1~variable, value.var=c('dummyMean', 'continMean'))})
res[[1]]
# g1 1_dummyMean 2_dummyMean 3_dummyMean 1_continMean 2_continMean
#1: a 0.50 0.0 0.5 128.00000 NaN
#2: fj 0.75 0.5 0.0 94.66667 64
#3: g 0.25 0.5 0.5 54.00000 57
# 3_continMean
#1: 17
#2: NaN
#3: 146
Or a base R
option using Map
. Created functions 'fdummy', 'fcontin' to subset the 'dummy' and 'contin' columns. Loop through 'u' (lapply(...)
). Use Map
to get the corresponding columns of 'dummy' and 'contin', grouped by 'g1' column, get the mean
of 'dummy' and mean
of 'contin' columns with 'dummy==1' using tapply
, cbind
the results.
fdummy <- function(x) x[grep('dummy', names(x))]
fcontin <- function(x) x[grep('contin', names(x))]
res2 <- lapply(u, function(x) {
do.call(cbind.data.frame,
Map(function(x,y,z) cbind(tapply(x,z, FUN=mean),
tapply(y[x==1],z[x==1], FUN=mean)),
fdummy(x), fcontin(x), x['g1']))})
lapply(res2, setNames, c(rbind(paste0('dummyMean', 1:3),
paste0('continMean',1:3))))[[1]]
# dummyMean1 continMean1 dummyMean2 continMean2 dummyMean3 continMean3
#a 0.50 128.00000 0.0 NA 0.5 17
#fj 0.75 94.66667 0.5 64 0.0 NA
#g 0.25 54.00000 0.5 57 0.5 146
Upvotes: 1
Reputation: 3635
If you have dplyr installed, the following code seems to solve your problem.
library(dplyr)
set.seed(1234)
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
rates <- lapply(u, function(x)
x %>%
mutate( contin1_ = ifelse(dummy1==1, contin1, NA) ) %>%
mutate( contin2_ = ifelse(dummy2==1, contin2, NA) ) %>%
mutate( contin3_ = ifelse(dummy3==1, contin3, NA) ) %>%
group_by(g1) %>%
summarize(
V1 = mean(dummy1, na.rm=TRUE),
V2 = mean(dummy2, na.rm=TRUE),
V3 = mean(dummy3, na.rm=TRUE),
mean1 = mean(contin1_, na.rm=TRUE),
mean2 = mean(contin2_, na.rm=TRUE),
mean3 = mean(contin3_, na.rm=TRUE)
)
)
print(rates[[1]])
Which gives me this:
Source: local data frame [3 x 7]
g1 V1 V2 V3 mean1 mean2 mean3
1 a 0.50 0.0 0.5 128.00000 NaN 17
2 fj 0.75 0.5 0.0 94.66667 64 NaN
3 g 0.25 0.5 0.5 54.00000 57 146
The number that I get seem to be approximately correct and NA are in all the right places. Unfortunately, your example is not fully reproducible since you did not specify the seed for generating random variables and thus, my runif gives me different values than yours.
Upvotes: 1