Ignacio
Ignacio

Reputation: 7928

group by and then count missing variables?

My data looks something like this:

df1 <- data.frame(
  Z = sample(LETTERS[1:5], size = 10000, replace = T),
  X1 = sample(c(1:10,NA), 10000, replace = T),
  X2 = sample(c(1:25,NA), 10000, replace = T),
  X3 = sample(c(1:5,NA), 10000, replace = T)
)

I can count the missing variables with:

data.frame("Total Missing" = colSums(is.na(df1))) 

But, I would like to this by Z. That is, the number of missing X1-3 for each value of Z.

I tried this

df1 %>% group_by(Z) %>% summarise('Total Missing' = colSums(is.na(df1)))

but it does not work as I expected.

Upvotes: 3

Views: 614

Answers (1)

talat
talat

Reputation: 70256

You can use summarise_each:

df1 %>% 
  group_by(Z) %>% 
  summarise_each(funs(sum(is.na(.))))
#Source: local data frame [5 x 4]
#
#       Z    X1    X2    X3
#  (fctr) (int) (int) (int)
#1      A   169    77   334
#2      B   170    77   316
#3      C   159    78   348
#4      D   181    79   326
#5      E   174    69   341

Note that you can specify inside summarise_each which columns to apply the function to (default is all columns except grouping columns) or which columns the function should not be applied to. It may also be interesting for you to note that like summarise_each to summarise, there's also mutate_each as the complement to mutate if you want to apply functions to all columns without summarising the result.

The obligatory data.table equivalent is:

library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(is.na(x))), by = Z]
#   Z  X1 X2  X3
#1: D 181 79 326
#2: C 159 78 348
#3: B 170 77 316
#4: A 169 77 334
#5: E 174 69 341

And in base R you could use a split/apply/combine approach like the following:

do.call(rbind,
        lapply(
          split(df1, df1$Z), function(dd) {
            colSums(is.na(dd[-1]))
          }))
#   X1 X2  X3
#A 169 77 334
#B 170 77 316
#C 159 78 348
#D 181 79 326
#E 174 69 341

Or, also in base R, you can use aggregate:

aggregate(df1[-1], list(df1$Z), FUN = function(y) sum(is.na(y))) 
aggregate(. ~ Z, df1, FUN = function(y) sum(is.na(y)), na.action = "na.pass") # formula interface

Upvotes: 5

Related Questions