Math Avengers
Math Avengers

Reputation: 792

Apply PCA to data with NA values in R

I want to apply PCA (pcomp()) to a data frame with NA values. I know it is not really possible to apply PCA is na values, and (I still tried) I got the error : Error in na.fail.default(X) : missing values in object. I don't want to remove any rows because it is a relatively small sample size. So how can i do it?

Example:

> dput(df)
structure(list(Sample1 = 1:5, Sample2 = 11:15, Sample3 = structure(1:5, .Label = c("11", 
"12", "13", "14", "NA"), class = "factor"), Sample4 = structure(c(1L, 
1L, 4L, 2L, 3L), .Label = c("1", "4", "5", "NA"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

Upvotes: 2

Views: 6374

Answers (2)

AEP
AEP

Reputation: 160

Have a look at the missMDA package that can be used with the PCA function from the FactoMineR package (or any other function performing PCA).

More information about missMDA can be found in the accompanying paper (Josse & Husson, 2016) or on the R-Miss-Tastic website (https://rmisstastic.netlify.app/lectures/).

There is also a really helpful YouTube series accompanying these packages. Here is the link to the 'Handling missing values in PCA' video explaining how the missMDA package can be used for PCA with missing data: https://www.youtube.com/watch?v=OOM8_FH6_8o&t=8s.

The fantastic thing about missMDA is that it can be used to also generate single and multiple imputation datasets for downstream analyses (for a comparison to other MI methods see, Audigier, Husson, & Josse, 2016).

Multiple imputation generates several imputed datasets and the variance between-imputations reflects the uncertainty of the predictions of the missing entries (using an imputation model). The missMDA package provides a way to visualize this uncertainty associated to the predictions (see this blog, https://francoishusson.wordpress.com/2017/08/05/can-we-believe-in-the-imputations/?utm_content=bufferaaee4&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)

References Josse, J., & Husson, F. (2016). missMDA: a package for handling missing values in multivariate data analysis. Journal of Statistical Software, 70(1), 1-31.

Audigier, V., Husson, F., & Josse, J. (2016). Multiple imputation for continuous variables using a Bayesian principal component analysis. Journal of statistical computation and simulation, 86(11), 2140-2156.

Upvotes: 0

Cal
Cal

Reputation: 21

You basically have 2 options:

  1. Impute data using mean, median etc per the first reply.
  2. pcaMethods R package with method = NIPALS incorporates machine learning and non-linear PCA that can be executed with NAs.

I'll leave it there.

Upvotes: 2

Related Questions