Reputation: 59
In the R program, I will generate a high-dimensional dataset using the following codes and create missing datasets with MAR, MCAR and MNAR mechanisms, with 5%, 25% and 40% missing rates:
generateData<- function(n,p) {
sigma <- diag(p)
sigma <- replace(sigma, sigma == 0, 0.3)
mu= rep(0,nrow(sigma))
X <- mvrnorm(n, mu = mu, Sigma = sigma)
vCoef = rnorm(ncol(X))
vProb =exp(X%*%vCoef)/(1+exp(X%*%vCoef))
Y <- rbinom(nrow(X), 1, vProb)
data= data.frame(cbind(X,Y))
return(data)
}
data <- generateData(n = 100,p=120)
X <- data[-ncol(data)]
Y <- data[ncol(data)]
Next I will compare the performance of imputation methods. I tried using the ampute function to generate missing datasets but when I run the code I get the following error, which I think is related to pattern and weight:
result <- ampute(X, prop = 0.4, mech ='MAR', type="RIGHT", bycases=FALSE)
Error: Proportion of missing cells is too large in combination with the desired number of missing variables
While using the ampute function, I cannot make the necessary adjustments for the pattern and weight. I tried various pattern and weight values for MAR, MCAR and MNAR but it didn't work. Also, I don't know if it is necessary to create missing datasets using all of the variables or just some of the variables (for example, the first 50 variables) to create missing datasets. As imputation methods, I will use EM, KNN, random forests, regression-based methods, naive bayes, artificial neural networks as well as classical methods. Can I use it by making the necessary adjustments to the amputee function or should I use another function? Thanks in advance for your help.
Upvotes: 1
Views: 710
Reputation: 26
Since you’re editing some missing cross-references, I deleted my old answer (which should have been a comment instead) and am trying to be complete and summarize my answer here.
I think that problem here is due to a misuse of the argument bycases
. In fact, if it is set to FALSE
, the prop
argument defines the proportion of missing entries in your data frame. If you set prop = .4
, given the dimension of your data frame (120,000 entries) and the default pattern (where the missingness is on one variable only), you are asking for a dataframe with 4800 missing values all on one variable (that has 100 entries).
If you consider the proportion of missingness to be defined in terms of cases
data <- generateData(n = 100, p=120)
X <- data[-ncol(data)]
Y <- data[ncol(data)]
result2 <- ampute(X, prop = 0.4)
result2$prop
#[1] 0.4
no error occurs, since you are requiring 40 observations (out of 100) to have missing values on one variable (since we are still employing the default pattern).
If you want to consider bycases = FALSE
you should either define a pattern that induces missingness on more than one variable, or set a proportion such that the number of missing values for a single covariate is less than the number of observations:
> result3 <- ampute(X, prop = 0.0075, bycases = FALSE)
> result3$prop
#[1] 0.9
## that is 120x100x.0075= 90 < 100
Here I report a simple script to generate the dataset you need.
rm(list=ls())
library(mice)
#>
#> Caricamento pacchetto: 'mice'
#> Il seguente oggetto è mascherato da 'package:stats':
#>
#> filter
#> I seguenti oggetti sono mascherati da 'package:base':
#>
#> cbind, rbind
set.seed(221)
n <- 100
P <- 120
pstar <- 50
covmat <- toeplitz((P:1)/P)
npat <- 120
testdata <- MASS::mvrnorm(n = n, mu = rep(0, P), Sigma = covmat)
testdata <- as.data.frame(testdata)
myfreq <- .15 #.05 .25
mypatterns <- matrix(1, nrow = npat, ncol = P)
for(i in 1:npat){
idx <- sample(x = 1:pstar, size = myfreq * n, replace = F)
mypatterns[i,idx] <- 0
}
#mypatterns
result <- ampute(testdata, patterns = mypatterns)
md.pattern(result$amp)
Created on 2021-11-19 by the reprex package (v2.0.1)
Session infosessioninfo::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────────
#> hash: person bowing, person taking bath, vampire: medium-dark skin tone
#>
#> setting value
#> version R version 4.1.0 (2021-05-18)
#> os Ubuntu 20.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate it_IT.UTF-8
#> ctype it_IT.UTF-8
#> tz Europe/Rome
#> date 2021-11-19
#> pandoc 2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
#> backports 1.3.0 2021-10-27 [1] CRAN (R 4.1.0)
#> broom 0.7.10 2021-10-31 [1] CRAN (R 4.1.0)
#> cli 3.1.0 2021-10-27 [1] CRAN (R 4.1.0)
#> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.0)
#> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
#> digest 0.6.28 2021-09-23 [1] CRAN (R 4.1.0)
#> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
#> generics 0.1.1 2021-10-25 [1] CRAN (R 4.1.0)
#> glue 1.5.0 2021-11-07 [1] CRAN (R 4.1.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
#> knitr 1.36 2021-09-29 [1] CRAN (R 4.1.0)
#> lattice 0.20-44 2021-05-02 [4] CRAN (R 4.1.0)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
#> MASS 7.3-54 2021-05-03 [4] CRAN (R 4.0.5)
#> mice * 3.13.0 2021-01-27 [1] CRAN (R 4.1.0)
#> mime 0.12 2021-09-28 [1] CRAN (R 4.1.0)
#> pillar 1.6.4 2021-10-18 [1] CRAN (R 4.1.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
#> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
#> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.0)
#> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> sessioninfo 1.2.1 2021-11-02 [1] CRAN (R 4.1.0)
#> stringi 1.7.5 2021-10-04 [1] CRAN (R 4.1.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
#> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.0)
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.0)
#> tidyr 1.1.4 2021-09-27 [1] CRAN (R 4.1.0)
#> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
#> xfun 0.28 2021-11-04 [1] CRAN (R 4.1.0)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
#>
#> [1] /home/matt/R/x86_64-pc-linux-gnu-library/4.1
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Upvotes: 1
Reputation: 59
I think you should provide a MWE in order to reproduce your error. Since I don't know what mypattern is like, I can't help you!
Thanks for the answer! Even though I tried different patterns, I got the same error every time. For example, I tried a pattern like the one below to create missing data in the first 50 variables:
############### PATTERN ###########################
a <- ncol(X)
b <- 50
mypattern <- matrix(rep(1, a*b), ncol=a, nrow=b)
for(i in 1:b) {
mypattern[i,i] = 0
}
Also the problem persisted when i created pattern by default. I updated my codes above as default pattern.
Upvotes: 0