Fast aggregation using dplyr - better way?

Question

I have a large data frame where I have a grouping variable and then lots of other variable columns. I want to calculate the mean of each variable by group - but I want to take account of the proportion of missing data. If there is >75% of the data then calculate the mean, if not return NA.

My actual data has many more columns than the test data below. This approach seems to be pretty quick. My question is whether there is a quicker way?

# number of groups
n <- 100000
dat <- data.frame(grp = factor(rep(1:n, each = 10)),
              var1 = rep(c(1:8, NA, NA), times = n),
              var2 = rep(c(1:7, NA, NA, NA), times = n)
              )

# summarise by group, calculate mean if enough data
res <- dat %>% 
  group_by(grp) %>% 
  summarise_each(funs(ifelse(length(na.omit(.)) / length(.) > 0.75, 
                         mean(., na.rm = TRUE), NA)))

Thanks

David

BrodieG · Accepted Answer

Here is an option that is almost 5x faster:

system.time(
  res0 <- dat %>% 
    group_by(grp) %>% 
    summarise_each(
      funs(
        ifelse(
          length(na.omit(.)) / length(.) > 0.75, 
          mean(., na.rm = TRUE), NA)
      ) )
)
#   user  system elapsed 
#   7.27    0.00    7.29 
system.time(
  res1 <- dat %>% 
    group_by(grp) %>% 
    summarise_each(
      funs(
        if(sum(is.na(.)) / length(.) < 0.25) mean(., na.rm=TRUE) 
        else NA
    ) )
)
#   user  system elapsed 
#   1.59    0.00    1.60 
all.equal(res0, res1)
#  [1] TRUE

And an extra 2x speed increase with data.table:

system.time(
  res2 <- setDT(dat)[, 
    lapply(
      .SD, 
      function(x) 
        if(sum(is.na(x)) / .N < 0.25) mean(x, na.rm=TRUE) else NA
    ), 
  by=grp]
)
#   user  system elapsed 
#   0.76    0.00    0.76 
all.equal(res0, setDF(res2))
#  [1] TRUE

Fast aggregation using dplyr - better way?

Answers (1)

Related Questions