Reputation: 13
I have a dataset with quite a bit of missing data in some columns (~20%) and am trying to figure out what proportion of these are in the same patients (ex. are the 20% of patients missing heart rate the same 20% that are missing systolic blood pressure?). The main purpose of this is to determine whether it is more common for data to be missing in patients with particular outcomes. I've tried to use the varclus package in R but I haven't been having any luck. Any suggestions and guidance is greatly appreciated, thank you! :)
Upvotes: 1
Views: 977
Reputation: 3228
The other options here are good, but a couple additional options are missingness maps and missingness matrices from the mice and Amelia packages. I demonstrate below on the airquality
dataset in R.
#### Load Libraries ####
library(mice)
library(Amelia)
#### Missingness Map ####
missmap(airquality)
#### Missingness Matrix ####
md.pattern(airquality)
The missingness map looks like this, where the horizontal sweeps in white indicate some relationship between missing values. You can see here that ozone and solar radiation have some related missingness at times, but it is fairly minor:
The matrix looks like this, which shows by row and by column missingness across variables. For example, Row 1 is completely observed (no missing data) with 111 values. Row 2 has 35 counts of one missingness pattern (ozone):
Upvotes: 0
Reputation: 7730
The naniar R package for missing data visualization, offers multiple easy to call missing data plotting functions - very practical to explore your missing data. (here is a link to the package documentation to see, what plots are available naniar plot gallery).
For example these two plots could really help you:
1. Missingness across factors
gg_miss_fct(x = riskfactors, fct = marital)
2. Combinations of missingness across cases
Upset plot for combinations of missingness across cases (see combinations of missingness and intersections of missingness amongst variables).
gg_miss_upset(riskfactors)
Upvotes: 1
Reputation: 17069
Here's a tidyverse workflow to visualize missingness across your dataset:
library(dplyr)
library(tidyr)
library(ggplot2)
starwars %>%
mutate(across(everything(), is.na)) %>%
arrange(across(everything())) %>%
mutate(row = row_number()) %>%
pivot_longer(!row, names_to = "column", values_to = "missing") %>%
ggplot() +
geom_tile(aes(row, column, fill = missing))
For starters, it looks like the same rows tend to be missing species
, sex
, and gender
. To confirm, we can do:
starwars %>%
count(across(c(species, sex, gender), is.na))
#> # A tibble: 2 × 4
#> species sex gender n
#> <lgl> <lgl> <lgl> <int>
#> 1 FALSE FALSE FALSE 83
#> 2 TRUE TRUE TRUE 4
Created on 2022-10-24 with reprex v2.0.2
This confirms that in all cases where species
, sex
, and gender
are missing, the other two are missing as well.
PS - the mice package has more tools for exploring missing data.
Upvotes: 4