r: group, remove columns, and sum

Question

I've had some trouble with a large data.frame. I need to sum each column of groups, if each group column does not have any 0's (complete). I.E. I only want to sum columns of each group that is "complete".

Here is an example of needing to group and sum each column, however, I cannot figure out how to work complete.cases in a dplyr pipeline

df <- data.frame(ca = c("a","b","a","c","b"),
             f = c(3,4,0,2,3),
             f2 = c(2,5,6,1,9),
             f3 = c(3,0,6,3,0))

What the outcome should look like

  ca  f f2 f3
1  a NA  8  9
2  b  7 14 NA
3  c  2  1  3

This works to sum each group

df2 <- df %>%
    arrange(ca) %>%
    group_by(ca) %>%
    summarize_at(.cols=vars(starts_with("f")),
            .funs=funs("sum"))

Here is what I cannot get to work, but it seems like what I should be working towards

df2 <- df %>%
    arrange(ca) %>%
    group_by(ca) %>%
    summarize_(funs_(sum(complete.cases(.),na.rm=T)))

Maybe I need a summarize_if, any help would be greatly appreciated.

alistaire · Accepted Answer

If one column is grouped, the *_all functions will operate on all the non-grouping columns. You can use na_if to insert NAs for a particular value, which makes the whole process fairly simple:

df %>% mutate_all(funs(na_if(., 0L))) %>% 
    group_by(ca) %>%
    summarise_all(sum)

## # A tibble: 3 × 4
##       ca     f    f2    f3
##      
## 1      a    NA     8     9
## 2      b     7    14    NA
## 3      c     2     1     3

or combine the two calls, if you like:

df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))

which returns the same thing.

Benchmarks

Per the comments, benchmarks on 10000 rows and 100 non-grouping columns. Very wide data (more than 1000 columns) does not fare well with either method, but if you gather to long and group by the former variable names, it's tolerable.

library(tidyr)
set.seed(47)

df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE), 
                 replicate(100, rpois(100, 10)))

microbenchmark::microbenchmark(
    'two stp' = {
        df %>% mutate_all(funs(na_if(., 0L))) %>% 
            group_by(ca) %>% summarise_all(sum)
    }, 'one stp' = {
        df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))
    }, 'two stp, reshape' = {
        df %>% gather(var, val, -ca) %>% 
            mutate(val = na_if(val, 0L)) %>% 
            group_by(ca, var) %>% summarise(val = sum(val)) %>% 
            spread(var, val)
    }, 'one stp, reshape' = {
        df %>% gather(var, val, -ca) %>% 
            group_by(ca, var) %>% summarise(val = sum(na_if(val, 0L))) %>% 
            spread(var, val)
    })

## Unit: milliseconds
##              expr       min        lq      mean    median        uq      max neval cld
##           two stp 311.36733 330.23884 347.77353 340.98458 354.21105 548.4810   100   c
##           one stp 299.90327 317.38300 329.78662 326.66370 341.09945 385.1589   100  b 
##  two stp, reshape  61.72992  67.78778  85.94939  73.37648  81.04525 300.5608   100 a  
##  one stp, reshape  70.95492  77.76685  90.53199  83.33557  90.14023 297.8924   100 a

Using data.tables via dtplyr is much faster. If you don't mind learning another grammar, writing in data.table is faster yet (h/t @docendodiscimus for replace). Reshaping results in worse times here, at least with tidyr functions, though with data.table::melt and dcast it still may be a good option for extremely wide data.

library(data.table)
library(dtplyr)
set.seed(47)

df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE), 
                 replicate(100, rpois(10000, 10)))
setDT(df)

microbenchmark::microbenchmark(
    'dtplyr 2 stp' = {
        df %>% mutate_all(funs(na_if(., 0L))) %>% 
            group_by(ca) %>% 
            summarise_all(sum)
    }, 'dtplyr 1 stp' = {
        df %>% group_by(ca) %>% 
            summarise_all(funs(sum(na_if(., 0L))))
    }, 'dt + na_if 2 stp' = {
        df[, lapply(.SD, function(x){na_if(x, 0L)})][, lapply(.SD, sum), by = ca]
    }, 'dt + na_if 1 stp' = {
        df[, lapply(.SD, function(x){sum(na_if(x, 0L))}), by = ca]
    }, 'pure dt 2 stp' = {
        df[, lapply(.SD, function(x){replace(x, x == 0L, NA)})][, lapply(.SD, sum), by = ca]
    }, 'pure dt 1 stp' = {
        df[, lapply(.SD, function(x){sum(replace(x, x == 0L, NA))}), by = ca]
    })

## Unit: milliseconds
##              expr       min        lq      mean    median        uq       max neval cld
##      dtplyr 2 stp 121.31556 130.88189 143.39661 138.32966 146.39086 355.24750   100   c
##      dtplyr 1 stp  28.30813  31.03421  36.94506  33.28435  43.46300  55.36789   100  b 
##  dt + na_if 2 stp  27.03971  29.04306  34.06559  31.20259  36.95895  53.66865   100  b 
##  dt + na_if 1 stp  10.50404  12.64638  16.10507  13.43007  15.18257  34.37919   100 a  
##     pure dt 2 stp  27.15501  28.91975  35.07725  30.28981  33.03950 238.66445   100  b 
##     pure dt 1 stp  10.49617  12.09324  16.31069  12.84595  20.03662  34.44306   100 a

r: group, remove columns, and sum

Answers (2)

Benchmarks

Related Questions