R -- How can I calculate group means for a list of data frames, using a different subset condition to calculate each mean?

Question

I have a list of three data frames and would like to generate another list of three data frames whose rows consist of each of the values of a grouping variable (g1) and the means of six variables by the g1 variable. The twist is that I would like to calculate the means for the three continuous variables only when the value of the corresponding dummy variable equals 1.

Reproducible example:

    a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))

u[[1]]

> u
[[1]]
   g1 dummy1 dummy2 dummy3  contin1 contin2 contin3
1  fj      1      0      0       199      18      61
2  fj      1      0      0        91     158      28
3  fj      1      1      0       147      67     190
4   a      1      0      1       181     105      22
5  fj      0      1      0        14      16     156
6   a      0      0      0       178      14      98
7   g      0      0      1       116      97      30
8   g      1      1      1        48      31     144
9   g      0      0      0        60      21     112
10  g      0      1      0        95     145     199

I would like to calculate the mean of contin1 only when dummy1 = 1, mean of contin2 only when dummy2 = 1, and mean of contin3 only when dummy3 = 1

The output I WANT for the first list:

> rates
[[1]]
  x[, 1]   V1  V2  V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1      a 0.50 0.0 0.5      a 181         a  NA         a  22
2     fj 0.75 0.5 0.0     fj 145.67     fj  41.5      fj  NA
3      g 0.25 0.5 0.5      g  48         g  88         g  87

What I have tried:

rates <- lapply(u, function(x) {
    cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
    aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
    aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
    aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
    })
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 3, 2

I understand that this error is coming from cbind, because cbind fails whenever you try to cbind objects with different numbers of rows. (The column x[, 6] has three rows whereas x[, 7] and x[, 8] have two.) I guess I was hoping that there was some way for aggregate to keep one row per grouping variable, which would mean that I would have the same number of rows and the cbind would work. Perhaps this is not possible per the R documentation?: "Rows with missing values in any of the by variables will be omitted from the result."

I have cafefully read the documentation for aggregate. The following two posts address similar issues but not using different subsets of the data to calculate the means.

R: Calculate means for subset of a group and Means from a list of data frames in R

Any suggestions would be hugely appreciated.

akhmed · Accepted Answer

If you have dplyr installed, the following code seems to solve your problem.

library(dplyr)

set.seed(1234)

a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))


rates <- lapply(u, function(x)
  x %>% 
    mutate( contin1_ = ifelse(dummy1==1, contin1, NA) ) %>%
    mutate( contin2_ = ifelse(dummy2==1, contin2, NA) ) %>%
    mutate( contin3_ = ifelse(dummy3==1, contin3, NA) ) %>%
    group_by(g1) %>%
    summarize( 
              V1 = mean(dummy1, na.rm=TRUE),
              V2 = mean(dummy2, na.rm=TRUE),
              V3 = mean(dummy3, na.rm=TRUE),
              mean1 = mean(contin1_, na.rm=TRUE),
              mean2 = mean(contin2_, na.rm=TRUE),
              mean3 = mean(contin3_, na.rm=TRUE)
               )
)

print(rates[[1]])

Which gives me this:

Source: local data frame [3 x 7]

  g1   V1  V2  V3     mean1 mean2 mean3
1  a 0.50 0.0 0.5 128.00000   NaN    17
2 fj 0.75 0.5 0.0  94.66667    64   NaN
3  g 0.25 0.5 0.5  54.00000    57   146

The number that I get seem to be approximately correct and NA are in all the right places. Unfortunately, your example is not fully reproducible since you did not specify the seed for generating random variables and thus, my runif gives me different values than yours.

R -- How can I calculate group means for a list of data frames, using a different subset condition to calculate each mean?

Answers (2)

Related Questions