Reputation: 428
I've had some trouble with a large data.frame. I need to sum each column of groups, if each group column does not have any 0's (complete). I.E. I only want to sum columns of each group that is "complete".
Here is an example of needing to group and sum each column, however, I cannot figure out how to work complete.cases
in a dplyr pipeline
df <- data.frame(ca = c("a","b","a","c","b"),
f = c(3,4,0,2,3),
f2 = c(2,5,6,1,9),
f3 = c(3,0,6,3,0))
What the outcome should look like
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
This works to sum each group
df2 <- df %>%
arrange(ca) %>%
group_by(ca) %>%
summarize_at(.cols=vars(starts_with("f")),
.funs=funs("sum"))
Here is what I cannot get to work, but it seems like what I should be working towards
df2 <- df %>%
arrange(ca) %>%
group_by(ca) %>%
summarize_(funs_(sum(complete.cases(.),na.rm=T)))
Maybe I need a summarize_if
, any help would be greatly appreciated.
Upvotes: 2
Views: 3084
Reputation: 43334
If one column is grouped, the *_all
functions will operate on all the non-grouping columns. You can use na_if
to insert NA
s for a particular value, which makes the whole process fairly simple:
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>%
summarise_all(sum)
## # A tibble: 3 × 4
## ca f f2 f3
## <fctr> <dbl> <dbl> <dbl>
## 1 a NA 8 9
## 2 b 7 14 NA
## 3 c 2 1 3
or combine the two calls, if you like:
df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))
which returns the same thing.
Per the comments, benchmarks on 10000 rows and 100 non-grouping columns. Very wide data (more than 1000 columns) does not fare well with either method, but if you gather to long and group by the former variable names, it's tolerable.
library(tidyr)
set.seed(47)
df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE),
replicate(100, rpois(100, 10)))
microbenchmark::microbenchmark(
'two stp' = {
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>% summarise_all(sum)
}, 'one stp' = {
df %>% group_by(ca) %>% summarise_all(funs(sum(na_if(., 0L))))
}, 'two stp, reshape' = {
df %>% gather(var, val, -ca) %>%
mutate(val = na_if(val, 0L)) %>%
group_by(ca, var) %>% summarise(val = sum(val)) %>%
spread(var, val)
}, 'one stp, reshape' = {
df %>% gather(var, val, -ca) %>%
group_by(ca, var) %>% summarise(val = sum(na_if(val, 0L))) %>%
spread(var, val)
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## two stp 311.36733 330.23884 347.77353 340.98458 354.21105 548.4810 100 c
## one stp 299.90327 317.38300 329.78662 326.66370 341.09945 385.1589 100 b
## two stp, reshape 61.72992 67.78778 85.94939 73.37648 81.04525 300.5608 100 a
## one stp, reshape 70.95492 77.76685 90.53199 83.33557 90.14023 297.8924 100 a
Using data.table
s via dtplyr
is much faster. If you don't mind learning another grammar, writing in data.table
is faster yet (h/t @docendodiscimus for replace
). Reshaping results in worse times here, at least with tidyr
functions, though with data.table::melt
and dcast
it still may be a good option for extremely wide data.
library(data.table)
library(dtplyr)
set.seed(47)
df <- data.frame(ca = sample(letters[1:3], 10000, replace = TRUE),
replicate(100, rpois(10000, 10)))
setDT(df)
microbenchmark::microbenchmark(
'dtplyr 2 stp' = {
df %>% mutate_all(funs(na_if(., 0L))) %>%
group_by(ca) %>%
summarise_all(sum)
}, 'dtplyr 1 stp' = {
df %>% group_by(ca) %>%
summarise_all(funs(sum(na_if(., 0L))))
}, 'dt + na_if 2 stp' = {
df[, lapply(.SD, function(x){na_if(x, 0L)})][, lapply(.SD, sum), by = ca]
}, 'dt + na_if 1 stp' = {
df[, lapply(.SD, function(x){sum(na_if(x, 0L))}), by = ca]
}, 'pure dt 2 stp' = {
df[, lapply(.SD, function(x){replace(x, x == 0L, NA)})][, lapply(.SD, sum), by = ca]
}, 'pure dt 1 stp' = {
df[, lapply(.SD, function(x){sum(replace(x, x == 0L, NA))}), by = ca]
})
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## dtplyr 2 stp 121.31556 130.88189 143.39661 138.32966 146.39086 355.24750 100 c
## dtplyr 1 stp 28.30813 31.03421 36.94506 33.28435 43.46300 55.36789 100 b
## dt + na_if 2 stp 27.03971 29.04306 34.06559 31.20259 36.95895 53.66865 100 b
## dt + na_if 1 stp 10.50404 12.64638 16.10507 13.43007 15.18257 34.37919 100 a
## pure dt 2 stp 27.15501 28.91975 35.07725 30.28981 33.03950 238.66445 100 b
## pure dt 1 stp 10.49617 12.09324 16.31069 12.84595 20.03662 34.44306 100 a
Upvotes: 3
Reputation: 38500
One way to go in base R is to fill the 0s in as NA and then use aggregate.
# fill 0s as NAs
is.na(df) <- df == 0
aggregate(cbind(f=df$f,f2=df$f2,f3=df$f3), df["ca"], sum)
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
Note: Using the formula interface to aggregate
may produce an unexpected result.
aggregate(.~ca, data=df, sum)
ca f f2 f3
1 a 3 2 3
2 c 2 1 3
The "b" category drops out and the value for a in variable f is 3, not NA. The specification in the help file indicates that na.action is set to na.omit
, which drops NA values from computation. To get the formula interface to work as desired, change this value to na.pass.
aggregate(.~ca, data=df, sum, na.action=na.pass)
ca f f2 f3
1 a NA 8 9
2 b 7 14 NA
3 c 2 1 3
Upvotes: 1