Reputation: 13

R function to determine if missing data is related

I have a dataset with quite a bit of missing data in some columns (~20%) and am trying to figure out what proportion of these are in the same patients (ex. are the 20% of patients missing heart rate the same 20% that are missing systolic blood pressure?). The main purpose of this is to determine whether it is more common for data to be missing in patients with particular outcomes. I've tried to use the varclus package in R but I haven't been having any luck. Any suggestions and guidance is greatly appreciated, thank you! :)

Upvotes: 1

Answers (3)

Shawn Hemelstrand

Reputation: 3228

The other options here are good, but a couple additional options are missingness maps and missingness matrices from the mice and Amelia packages. I demonstrate below on the airquality dataset in R.

#### Load Libraries ####
library(mice)
library(Amelia)

#### Missingness Map ####
missmap(airquality)

#### Missingness Matrix ####
md.pattern(airquality)

The missingness map looks like this, where the horizontal sweeps in white indicate some relationship between missing values. You can see here that ozone and solar radiation have some related missingness at times, but it is fairly minor:

The matrix looks like this, which shows by row and by column missingness across variables. For example, Row 1 is completely observed (no missing data) with 111 values. Row 2 has 35 counts of one missingness pattern (ozone):

Upvotes: 0

Steffen Moritz

Reputation: 7730

The naniar R package for missing data visualization, offers multiple easy to call missing data plotting functions - very practical to explore your missing data. (here is a link to the package documentation to see, what plots are available naniar plot gallery).

For example these two plots could really help you:

1. Missingness across factors

gg_miss_fct(x = riskfactors, fct = marital)

2. Combinations of missingness across cases

Upset plot for combinations of missingness across cases (see combinations of missingness and intersections of missingness amongst variables).

gg_miss_upset(riskfactors)

Upvotes: 1

zephryl

Reputation: 17069

Here's a tidyverse workflow to visualize missingness across your dataset:

library(dplyr)
library(tidyr)
library(ggplot2)

starwars %>% 
  mutate(across(everything(), is.na)) %>% 
  arrange(across(everything())) %>% 
  mutate(row = row_number()) %>% 
  pivot_longer(!row, names_to = "column", values_to = "missing") %>% 
  ggplot() +
  geom_tile(aes(row, column, fill = missing))

For starters, it looks like the same rows tend to be missing species, sex, and gender. To confirm, we can do:

starwars %>% 
  count(across(c(species, sex, gender), is.na))

#> # A tibble: 2 × 4
#>   species sex   gender     n
#>   <lgl>   <lgl> <lgl>  <int>
#> 1 FALSE   FALSE FALSE     83
#> 2 TRUE    TRUE  TRUE       4

^{Created on 2022-10-24 with reprex v2.0.2}

This confirms that in all cases where species, sex, and gender are missing, the other two are missing as well.

PS - the mice package has more tools for exploring missing data.

Upvotes: 4

R function to determine if missing data is related

Answers (3)

Related Questions