Reputation: 850
I couldn't find any similar question although I doubt that this has not been posted before. My question is related to Calculate using dplyr, percentage of NA'S in each column.
In a dataset whith are multiple observations per subject, it is useful to not only calculate the total number of missing data entries (i.e. total NA's per column) but also how many subjects have missing data of some sort.
For instance, in dataset db
(see below) item_1
is missing for 2 subjects and item_2
is missing for 1 subject.
Edit 1: What I am interested in is how many subjects have (any) missing value per item. Even if in
item_2
there are two missing observations for subject number 1, this should be counted as 1 since it is still the same subject.
library("dplyr")
db <- data.frame(
subject = c(1, 1, 1, 2),
item_1 = c(NA, 2, 3, NA),
item_2 = c(1, NA, NA, 4)
)
db
#> subject item_1 item_2
#> 1 1 NA 1
#> 2 1 2 NA
#> 3 1 3 NA
#> 4 2 NA 4
My approach so far was to cbind
all single calculations into one new data.frame
but this quickly gets messy (with more columns) and is surely not well coded.
Edit 1: However, this shows the desired values, as
item_1
is missing for two subjects (1 and 2) and item_2 is only missing for 1 subject (subject 2).
cbind(
db %>%
filter(is.na(item_1)) %>%
summarise(na_item_1 = n_distinct(subject)),
db %>%
filter(is.na(item_2)) %>%
summarise(na_item_2 = n_distinct(subject))
)
#> na_item_1 na_item_2
#> 1 2 1
Question: is there an approach in dplyr
to calculate this?
Ideally, I would also like to add the proportion of missings somewhere (like in the example below):
data.frame(
type = c("n", "proportion"),
na_item_1 = c(2, 1.0),
na_item_2 = c(1, 0.5)
)
#> type na_item_1 na_item_2
#> 1 n 2.0 1.0
#> 2 proportion 1.0 0.5
Created on 2019-04-16 by the reprex package (v0.2.1)
Thanks in advance!
Upvotes: 3
Views: 174
Reputation: 388807
Another dplyr
version is to first group_by
subject
and find out the group which has any
NA
value, then group_by
column and calculate total value of NA
s for n
and divide it by total unique values of subject
to get prop
.
library(dplyr)
library(tidyr)
db %>%
group_by(subject) %>%
summarise_all(~any(is.na(.))) %>%
ungroup() %>%
select(-subject) %>%
gather() %>%
group_by(key) %>%
summarise(n = sum(value),
prop = n/n_distinct(db$subject))
# key n prop
# <chr> <int> <dbl>
#1 item_1 2 1
#2 item_2 1 0.5
Upvotes: 3
Reputation: 39858
A different tidyverse
possibility to assess the number of NA per item and per ID could be:
db %>%
gather(var, val, -subject) %>%
group_by(var, subject) %>%
summarise(val = sum(is.na(val))) %>%
spread(var, val)
subject item_1 item_2
<dbl> <int> <int>
1 1 1 2
2 2 1 0
Or if you want the overall number of NA and the proportion of NA per ID:
db %>%
gather(var, val, -subject) %>%
group_by(subject) %>%
summarise(count = sum(is.na(val)),
proportion = sum(is.na(val))/n())
subject count proportion
<dbl> <int> <dbl>
1 1 3 0.5
2 2 1 0.5
Or if you want the count and proportion of NA just per item:
bind_rows(db %>%
select(-subject) %>%
gather(var, val) %>%
group_by(var) %>%
summarise(val = sum(is.na(val))) %>%
spread(var, val) %>%
mutate(type = "count"),
db %>%
select(-subject) %>%
gather(var, val) %>%
group_by(var) %>%
summarise(val = sum(is.na(val))/n()) %>%
spread(var, val) %>%
mutate(type = "proportion"))
item_1 item_2 type
<dbl> <dbl> <chr>
1 2 2 count
2 0.5 0.5 proportion
Or if you want the number and proportion of unique subjects with NA per item:
bind_rows(db %>%
gather(var, val, -subject) %>%
filter(is.na(val)) %>%
group_by(var) %>%
summarise(val = n_distinct(subject)) %>%
spread(var, val) %>%
mutate(type = "count"),
db %>%
gather(var, val, -subject) %>%
group_by(var) %>%
mutate(n = n_distinct(subject)) %>%
filter(is.na(val)) %>%
group_by(var) %>%
summarise(val = first(n_distinct(subject)/n)) %>%
spread(var, val) %>%
mutate(type = "proportion"))
item_1 item_2 type
<dbl> <dbl> <chr>
1 2 1 count
2 1 0.5 proportion
Upvotes: 1