Frederick
Frederick

Reputation: 850

Calculate column NA's based on a grouping variable with dplyr

I couldn't find any similar question although I doubt that this has not been posted before. My question is related to Calculate using dplyr, percentage of NA'S in each column.


In a dataset whith are multiple observations per subject, it is useful to not only calculate the total number of missing data entries (i.e. total NA's per column) but also how many subjects have missing data of some sort.

For instance, in dataset db (see below) item_1 is missing for 2 subjects and item_2 is missing for 1 subject.

Edit 1: What I am interested in is how many subjects have (any) missing value per item. Even if in item_2 there are two missing observations for subject number 1, this should be counted as 1 since it is still the same subject.

library("dplyr")

db <- data.frame(
  subject = c(1, 1, 1, 2),
  item_1 = c(NA, 2, 3, NA),
  item_2 = c(1, NA, NA, 4)
)
db
#>   subject item_1 item_2
#> 1       1     NA      1
#> 2       1      2     NA
#> 3       1      3     NA
#> 4       2     NA      4

My approach so far was to cbind all single calculations into one new data.frame but this quickly gets messy (with more columns) and is surely not well coded.

Edit 1: However, this shows the desired values, as item_1 is missing for two subjects (1 and 2) and item_2 is only missing for 1 subject (subject 2).

cbind(
  db %>%
    filter(is.na(item_1)) %>%
    summarise(na_item_1 = n_distinct(subject)),
  db %>%
    filter(is.na(item_2)) %>%
    summarise(na_item_2 = n_distinct(subject))
)
#>   na_item_1 na_item_2
#> 1         2         1

Question: is there an approach in dplyr to calculate this?

Ideally, I would also like to add the proportion of missings somewhere (like in the example below):

data.frame(
  type = c("n", "proportion"),
  na_item_1 = c(2, 1.0),
  na_item_2 = c(1, 0.5)
)
#>         type na_item_1 na_item_2
#> 1          n       2.0       1.0
#> 2 proportion       1.0       0.5

Created on 2019-04-16 by the reprex package (v0.2.1)

Thanks in advance!

Upvotes: 3

Views: 174

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388807

Another dplyr version is to first group_by subject and find out the group which has any NA value, then group_by column and calculate total value of NAs for n and divide it by total unique values of subject to get prop.

library(dplyr)
library(tidyr)

db %>%
  group_by(subject) %>%
  summarise_all(~any(is.na(.))) %>%
  ungroup() %>%
  select(-subject) %>%
  gather() %>%
  group_by(key) %>%
  summarise(n = sum(value), 
            prop = n/n_distinct(db$subject))

#   key       n  prop
#   <chr>  <int> <dbl>
#1 item_1     2   1  
#2 item_2     1   0.5

Upvotes: 3

tmfmnk
tmfmnk

Reputation: 39858

A different tidyverse possibility to assess the number of NA per item and per ID could be:

db %>%
 gather(var, val, -subject) %>%
 group_by(var, subject) %>%
 summarise(val = sum(is.na(val))) %>%
 spread(var, val)

  subject item_1 item_2
    <dbl>  <int>  <int>
1       1      1      2
2       2      1      0

Or if you want the overall number of NA and the proportion of NA per ID:

db %>%
 gather(var, val, -subject) %>%
 group_by(subject) %>%
 summarise(count = sum(is.na(val)),
           proportion = sum(is.na(val))/n())

  subject count proportion
    <dbl> <int>      <dbl>
1       1     3        0.5
2       2     1        0.5

Or if you want the count and proportion of NA just per item:

bind_rows(db %>%
 select(-subject) %>%
 gather(var, val) %>%
 group_by(var) %>%
 summarise(val = sum(is.na(val))) %>%
 spread(var, val) %>%
 mutate(type = "count"),
 db %>%
 select(-subject) %>%
 gather(var, val) %>%
 group_by(var) %>%
 summarise(val = sum(is.na(val))/n()) %>%
 spread(var, val) %>%
 mutate(type = "proportion"))

  item_1 item_2 type      
   <dbl>  <dbl> <chr>     
1    2      2   count     
2    0.5    0.5 proportion

Or if you want the number and proportion of unique subjects with NA per item:

bind_rows(db %>%
 gather(var, val, -subject) %>%
 filter(is.na(val)) %>%
 group_by(var) %>%
 summarise(val = n_distinct(subject)) %>%
 spread(var, val) %>%
 mutate(type = "count"),
 db %>%
 gather(var, val, -subject) %>%
 group_by(var) %>%
 mutate(n = n_distinct(subject)) %>%
 filter(is.na(val)) %>%
 group_by(var) %>%
 summarise(val = first(n_distinct(subject)/n)) %>%
 spread(var, val) %>%
 mutate(type = "proportion"))

  item_1 item_2 type      
   <dbl>  <dbl> <chr>     
1      2    1   count     
2      1    0.5 proportion

Upvotes: 1

Related Questions