LucaS
LucaS

Reputation: 1303

How to simulate MAR missing data in R?

I would like to simulate some missing data in R but am having trouble. I have created two variables ('pre' and 'post') that represent a measurement for the same individual pre- and post-treatment (i.e. paired data). I have been able to do it for data that is Missing Completely at Random (MCAR) - see below, but am unable to figure out how to code it for Missing at Random (MAR). For the MAR missing data, I would like to create 3 categories based on the pre-treatment observations that will determine how many of the post-treatment observations are missing. i.e.

For pre > 25, 40% post missing
For pre > 21 and ≤ 25, 30% post missing
For pre ≤ 21, 20% post missing

Can anyone help out? (I'd be really grateful!)
Thanks

set.seed(80122)
n <- 1000

# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
data <- rmvnorm(n,mean=c(28,18),sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix

# Split into pre and post treatment and check correlation is what was specified
pre <- data[, 1]
post <- data[, 2]
cor.test(pre,post)

# Simulate MCAR
mcar <- 1 - rbinom(n, 1, 0.2) # Will create ~ 20% zero's which we'll convert to NA's
post_mcar <- post
post_mcar[mcar == 0] <- mcar[mcar==0] # Replace post data with random zero's from mcar vector
post_mcar[mcar == 0] <- NA # Change zero's to NAs

Upvotes: 4

Views: 2768

Answers (1)

eipi10
eipi10

Reputation: 93871

This is an old question, but I thought I'd take a crack at it.

Simulate fake data as in the OP:

library(tidyverse)
library(mvtnorm)

# Number of data values
n <- 1000

# Simulate 1000 people with high pre-treatment (mean 28, sd 3) and normal (mean 18, sd 3) post-treatment. Correlation between paired data = 0.7.
set.seed(80122)
data <- rmvnorm(n, mean=c(28,18),
                sigma=matrix(c(9,0.7*sqrt(81),0.7*sqrt(81),9),2,2)) # Covariance matrix

Convert to data frame:

data = as.data.frame(data)
names(data) = c("pre", "post")

Simulate missing completely at random (MCAR) data:

data$post_mcar <- data$post

set.seed(2)
data$post_mcar[sample(1:nrow(data), 0.2*nrow(data))] = NA

Simulate missing at random (MAR) data: First, we'll create a grouping variable, frac, whose value is the fraction of the group that we want to set to missing. We'll use the cut function to create these groups and set the label values, then we'll convert the labels to numeric for later use:

data = data %>% 
  mutate(post_mar = post,
         frac = as.numeric(as.character(cut(pre, breaks=c(-Inf, 21, 25, Inf),
                                            labels=c(0.2,0.3,0.4)))))

Now, group by frac and set a randomly selected fraction of the values to NA, using frac to determine the fraction of values set to NA.

set.seed(3)
data = data %>% 
  group_by(frac) %>% 
  mutate(post_mar=replace(post_mar, row_number(post_mar) %in% sample(1:n(), round(unique(frac)*n())), NA)) %>% 
  ungroup

Here are the last 6 rows of the resulting data frame:

          pre     post post_mcar post_mar frac
995  28.63476 19.35081  19.35081 19.35081  0.4
996  32.86278 24.16119        NA       NA  0.4
997  28.25965 16.64538  16.64538 16.64538  0.4
998  24.35255 17.80365  17.80365 17.80365  0.3
999  28.12426 18.25222  18.25222       NA  0.4
1000 27.55075 14.47757  14.47757 14.47757  0.4

Here's a check on the fraction of values missing in each group. Note that the actual percentage of values set to missing can differ from frac if the requested percentage doesn't result in an integer number of rows. Here, for example, there's no way to select 20% of 8 values. It can be 12.5% (1 value) or 25% (2 values).

data %>% group_by(frac) %>% 
  summarise(N=n(), 
            N_missing=sum(is.na(post_mar)), 
            Frac_missing=N_missing/N)
  frac   N N_missing Frac_missing
1  0.2   8         2    0.2500000
2  0.3 138        41    0.2971014
3  0.4 854       342    0.4004684

Upvotes: 2

Related Questions