How can I generate missing data structures to run simulations on high dimensional data in R?

Question

In the R program, I will generate a high-dimensional dataset using the following codes and create missing datasets with MAR, MCAR and MNAR mechanisms, with 5%, 25% and 40% missing rates:

generateData<- function(n,p) {
sigma <- diag(p)
sigma <- replace(sigma, sigma == 0, 0.3)
mu= rep(0,nrow(sigma))
X <- mvrnorm(n, mu = mu, Sigma = sigma)
vCoef = rnorm(ncol(X))
vProb =exp(X%*%vCoef)/(1+exp(X%*%vCoef))
Y <- rbinom(nrow(X), 1, vProb)
data= data.frame(cbind(X,Y))
return(data)
}
data <- generateData(n = 100,p=120)
X <- data[-ncol(data)]
Y <- data[ncol(data)]

Next I will compare the performance of imputation methods. I tried using the ampute function to generate missing datasets but when I run the code I get the following error, which I think is related to pattern and weight:

result <- ampute(X, prop = 0.4, mech ='MAR', type="RIGHT", bycases=FALSE)
Error: Proportion of missing cells is too large in combination with the desired number of missing variables

While using the ampute function, I cannot make the necessary adjustments for the pattern and weight. I tried various pattern and weight values for MAR, MCAR and MNAR but it didn't work. Also, I don't know if it is necessary to create missing datasets using all of the variables or just some of the variables (for example, the first 50 variables) to create missing datasets. As imputation methods, I will use EM, KNN, random forests, regression-based methods, naive bayes, artificial neural networks as well as classical methods. Can I use it by making the necessary adjustments to the amputee function or should I use another function? Thanks in advance for your help.

Matteo Pedone · Accepted Answer

Since you’re editing some missing cross-references, I deleted my old answer (which should have been a comment instead) and am trying to be complete and summarize my answer here.

I think that problem here is due to a misuse of the argument bycases. In fact, if it is set to FALSE, the prop argument defines the proportion of missing entries in your data frame. If you set prop = .4, given the dimension of your data frame (120,000 entries) and the default pattern (where the missingness is on one variable only), you are asking for a dataframe with 4800 missing values all on one variable (that has 100 entries).

If you consider the proportion of missingness to be defined in terms of cases

data <- generateData(n = 100, p=120)
X <- data[-ncol(data)]
Y <- data[ncol(data)]

result2 <- ampute(X, prop = 0.4)
result2$prop
#[1] 0.4

no error occurs, since you are requiring 40 observations (out of 100) to have missing values on one variable (since we are still employing the default pattern).

If you want to consider bycases = FALSE you should either define a pattern that induces missingness on more than one variable, or set a proportion such that the number of missing values for a single covariate is less than the number of observations:

> result3 <- ampute(X, prop = 0.0075, bycases = FALSE)
> result3$prop
#[1] 0.9

## that is 120x100x.0075= 90 < 100

Here I report a simple script to generate the dataset you need.

rm(list=ls())
library(mice)
#> 
#> Caricamento pacchetto: 'mice'
#> Il seguente oggetto è mascherato da 'package:stats':
#> 
#>     filter
#> I seguenti oggetti sono mascherati da 'package:base':
#> 
#>     cbind, rbind
set.seed(221)

n <- 100
P <- 120
pstar <- 50
covmat <- toeplitz((P:1)/P)
npat <- 120

testdata <- MASS::mvrnorm(n = n, mu = rep(0, P), Sigma = covmat)
testdata <- as.data.frame(testdata)

myfreq <- .15 #.05 .25
mypatterns <- matrix(1, nrow = npat, ncol = P)
for(i in 1:npat){
  idx <- sample(x = 1:pstar, size = myfreq * n, replace = F)
  mypatterns[i,idx] <- 0
}
#mypatterns

result <- ampute(testdata, patterns = mypatterns)
md.pattern(result$amp)

^{Created on 2021-11-19 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info  ──────────────────────────────────────────────────────────────
#>  hash: person bowing, person taking bath, vampire: medium-dark skin tone
#> 
#>  setting  value
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Ubuntu 20.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  it_IT.UTF-8
#>  ctype    it_IT.UTF-8
#>  tz       Europe/Rome
#>  date     2021-11-19
#>  pandoc   2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
#>  backports     1.3.0   2021-10-27 [1] CRAN (R 4.1.0)
#>  broom         0.7.10  2021-10-31 [1] CRAN (R 4.1.0)
#>  cli           3.1.0   2021-10-27 [1] CRAN (R 4.1.0)
#>  crayon        1.4.2   2021-10-29 [1] CRAN (R 4.1.0)
#>  curl          4.3.2   2021-06-23 [1] CRAN (R 4.1.0)
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
#>  digest        0.6.28  2021-09-23 [1] CRAN (R 4.1.0)
#>  dplyr         1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
#>  generics      0.1.1   2021-10-25 [1] CRAN (R 4.1.0)
#>  glue          1.5.0   2021-11-07 [1] CRAN (R 4.1.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.0)
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
#>  knitr         1.36    2021-09-29 [1] CRAN (R 4.1.0)
#>  lattice       0.20-44 2021-05-02 [4] CRAN (R 4.1.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.0)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
#>  MASS          7.3-54  2021-05-03 [4] CRAN (R 4.0.5)
#>  mice        * 3.13.0  2021-01-27 [1] CRAN (R 4.1.0)
#>  mime          0.12    2021-09-28 [1] CRAN (R 4.1.0)
#>  pillar        1.6.4   2021-10-18 [1] CRAN (R 4.1.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
#>  Rcpp          1.0.7   2021-07-07 [1] CRAN (R 4.1.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         0.4.12  2021-10-18 [1] CRAN (R 4.1.0)
#>  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.1.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo   1.2.1   2021-11-02 [1] CRAN (R 4.1.0)
#>  stringi       1.7.5   2021-10-04 [1] CRAN (R 4.1.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.6.2   2021-09-23 [1] CRAN (R 4.1.0)
#>  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.0)
#>  tidyr         1.1.4   2021-09-27 [1] CRAN (R 4.1.0)
#>  tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun          0.28    2021-11-04 [1] CRAN (R 4.1.0)
#>  xml2          1.3.2   2020-04-23 [1] CRAN (R 4.1.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
#> 
#>  [1] /home/matt/R/x86_64-pc-linux-gnu-library/4.1
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

How can I generate missing data structures to run simulations on high dimensional data in R?

Answers (2)

Related Questions