Any way to select top n PCA components from accumulative PCA plot in R?

Question

I am interested to pick up top 10 PCA components from accumulative PCA plot for my dataset. I managed to get PCA plot such as scree plot, pairs plot and so on, but it doesn't make much sense to me. So I want to select top 10 PCA plot from its accumulative PCA plot and I did it, but I need to subset my original dataset by using this top 10 PCA component. Can anyone point me out to how to make attempt more accurate and desirable?

reproducible data:

persons_df <- data.frame(person1=sample(1:200,20, replace = FALSE),
                    person2=as.factor(sample(20)),
                    person3=sample(1:250,20, replace = FALSE),
                    person4=sample(1:300,20, replace = FALSE),
                    person5=as.factor(sample(20)),
                    person6=as.factor(sample(20)))

row.names(persons_df) <-letters[1:20]

my attempt:

my_pca <- prcomp(t(persons_df), center=TRUE, scale=FALSE)
summary(my_pca)

my_pca_proportionvariances <- cumsum(((my_pca$sdev^2) / (sum(my_pca$sdev^2)))*100)

public dataset:

since I have some issue when I created above reproducible data, here I linked public example dataset

here I need to select top 10 PCA component for persons_df, then subset original data then run a simple linear regression on that. How can I make my approach complete here in order to achieve my goal? Can anyone quickly point me out here? any idea?

alistaire · Accepted Answer

To use PCA for dimensionality reduction, in brief:

Omit your output variable (that's cheating) and create contrast variables with model.matrix if necessary. (Don't directly one-hot encode factors with lots of levels like zip code, or the size of your data will explode. Think smarter.) Remove any zero-variance variables. Deal with NAs.
Scale. A variable with a large scale (like salary) can make everything else look low-variance by comparison.
Run PCA with princomp or prcomp.

pca <- princomp(scale(cbind(mtcars[-1])))

To get the percentage of variance explained, pull the stdev vector out of the PCA object, square it to get variance, and scale by the sum so it sums to 1.

pct_var_explained <- pca$sdev^2 / sum(pca$sdev^2)
pct_var_explained
#>      Comp.1      Comp.2      Comp.3      Comp.4      Comp.5      Comp.6 
#> 0.576021744 0.264964319 0.059721486 0.026950667 0.022225006 0.021011744 
#>      Comp.7      Comp.8      Comp.9     Comp.10 
#> 0.013292009 0.008068158 0.005365235 0.002379633

Look at the cumulative sum of variance explained to see how many principal components you want to retain. For example, components 9 and 10 explain less that 0.25% of variance here. You can also use summary to do these calculations for you.

cumsum(pct_var_explained)
#>    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
#> 0.5760217 0.8409861 0.9007075 0.9276582 0.9498832 0.9708950 0.9841870 
#>    Comp.8    Comp.9   Comp.10 
#> 0.9922551 0.9976204 1.0000000

summary(pca)
#> Importance of components:
#>                           Comp.1    Comp.2     Comp.3     Comp.4
#> Standard deviation     2.3622469 1.6021366 0.76062599 0.51096437
#> Proportion of Variance 0.5760217 0.2649643 0.05972149 0.02695067
#> Cumulative Proportion  0.5760217 0.8409861 0.90070755 0.92765822
#>                            Comp.5     Comp.6     Comp.7      Comp.8
#> Standard deviation     0.46400943 0.45116656 0.35884027 0.279571602
#> Proportion of Variance 0.02222501 0.02101174 0.01329201 0.008068158
#> Cumulative Proportion  0.94988322 0.97089497 0.98418697 0.992255132
#>                             Comp.9     Comp.10
#> Standard deviation     0.227981824 0.151831138
#> Proportion of Variance 0.005365235 0.002379633
#> Cumulative Proportion  0.997620367 1.000000000

Subset to the principal components you want to keep and cbind your output variable back on.

train <- data.frame(
    mpg = mtcars$mpg, 
    predict(pca)[, cumsum(pct_var_explained) < 0.95]
)

Train your model.

model <- lm(mpg ~ ., train)
summary(model)
#> 
#> Call:
#> lm(formula = mpg ~ ., data = train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -4.2581 -1.2933 -0.4999  1.3939  5.2861 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 20.09062    0.44345  45.305  < 2e-16 ***
#> Comp.1      -2.28131    0.18772 -12.153 3.17e-12 ***
#> Comp.2       0.11632    0.27679   0.420   0.6778    
#> Comp.3       1.29925    0.58301   2.229   0.0347 *  
#> Comp.4      -0.09002    0.86787  -0.104   0.9182    
#> Comp.5       0.31279    0.95569   0.327   0.7461    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.509 on 26 degrees of freedom
#> Multiple R-squared:  0.8547, Adjusted R-squared:  0.8268 
#> F-statistic: 30.59 on 5 and 26 DF,  p-value: 4.186e-10

This particular model pretty much just needs 1 principal component—there's a lot of information the model can't do anything with in there. (Maybe it's irrelevant, redundant, or nonlinear.) Iterate.

Any way to select top n PCA components from accumulative PCA plot in R?

Answers (1)

Related Questions