aliakbarian
aliakbarian

Reputation: 769

plot PCA vs one dimension in R

I have a data set with 10 dimension as feature and 1 dimension as cluster number (11 dimension together). how can I plot the PCA of my data (PC1) vs cluster number using R?

qplot(x = not_null_df$TSC_8125, y =  pca, data = subset(not_null_df, select = c (not_null_df$AVG_ERTEBAT,not_null_df$AVG_ROSHD,not_null_df$AVG_HOGHOGH,not_null_df$AVG_MM,not_null_df$AVG_MK,not_null_df$AVG_TM,not_null_df$AVG_VEJHE,not_null_df$AVG_ANGIZEH,not_null_df$AVG_TAHOD)), main = "Loadings for PC1", xlab = "cluster number")

actually I wrote this part of code, and I got this error:

Don't know how to automatically pick scale for object of type princomp. Defaulting to continuous.
Error: Aesthetics must be either length 1 or the same as the data (564): x, y

summary(not_null_df)
     ï..QN           NAMECODE        GENDER      VAZEYATTAAHOL     TAHSILAT          SEN           SABEGHE     
 Min.   :  1.00   Min.   : 1.0   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.: 28.00   1st Qu.:11.0   1st Qu.:1.000   1st Qu.:1.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 60.00   Median :13.0   Median :1.000   Median :2.00   Median :3.000   Median :1.000   Median :1.000  
 Mean   : 68.63   Mean   :11.7   Mean   :1.152   Mean   :1.75   Mean   :2.578   Mean   :1.394   Mean   :1.121  
 3rd Qu.:103.25   3rd Qu.:14.0   3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:1.000  
 Max.   :190.00   Max.   :16.0   Max.   :2.000   Max.   :2.00   Max.   :3.000   Max.   :3.000   Max.   :3.000  
  AVG_ERTEBAT       AVG_ROSHD       AVG_HOGHOGH         AVG_MM           AVG_MK           AVG_TM         AVG_VEJHE     
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 5.333   1st Qu.: 4.125   1st Qu.: 1.750   1st Qu.: 5.000   1st Qu.: 3.125   1st Qu.: 5.981   1st Qu.: 4.556  
 Median : 7.000   Median : 5.875   Median : 3.500   Median : 7.727   Median : 5.000   Median : 8.000   Median : 6.333  
 Mean   : 6.730   Mean   : 5.787   Mean   : 4.001   Mean   : 6.903   Mean   : 4.890   Mean   : 7.390   Mean   : 6.095  
 3rd Qu.: 8.425   3rd Qu.: 7.656   3rd Qu.: 6.000   3rd Qu.: 9.182   3rd Qu.: 6.688   3rd Qu.: 9.204   3rd Qu.: 7.778  
 Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
  AVG_ANGIZEH       AVG_TAHOD        AVG_SOALAT        TSC_8125          avg       
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :1.000   Min.   :0.000  
 1st Qu.: 5.000   1st Qu.: 5.833   1st Qu.: 4.000   1st Qu.:1.000   1st Qu.:4.788  
 Median : 7.000   Median : 7.667   Median : 7.000   Median :2.000   Median :6.301  
 Mean   : 6.549   Mean   : 7.171   Mean   : 6.025   Mean   :2.046   Mean   :6.154  
 3rd Qu.: 8.750   3rd Qu.: 9.000   3rd Qu.: 8.000   3rd Qu.:3.000   3rd Qu.:7.599  
 Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :3.000   Max.   :9.978  

and I can get pca by this code:

pca <- princomp(not_null_df, cor=TRUE, scores=TRUE)

summary(pca)
Importance of components:
                         Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7     Comp.8     Comp.9
Standard deviation     2.887437 1.28937443 1.12619079 1.08816449 0.98432226 0.91257779 0.90980017 0.82303807 0.74435256
Proportion of Variance 0.438805 0.08749929 0.06675293 0.06232116 0.05099423 0.04383149 0.04356507 0.03565219 0.02916109
Cumulative Proportion  0.438805 0.52630426 0.59305720 0.65537835 0.70637258 0.75020406 0.79376914 0.82942133 0.85858242
                          Comp.10    Comp.11    Comp.12    Comp.13    Comp.14    Comp.15   Comp.16    Comp.17     Comp.18
Standard deviation     0.70304085 0.67709130 0.62905993 0.59284646 0.50799135 0.48013732 0.4476952 0.39317004 0.378722707
Proportion of Variance 0.02601402 0.02412909 0.02082718 0.01849826 0.01358185 0.01213325 0.0105490 0.00813593 0.007548994
Cumulative Proportion  0.88459644 0.90872553 0.92955271 0.94805097 0.96163282 0.97376607 0.9843151 0.99245101 1.000000000
                            Comp.19
Standard deviation     1.838143e-08
Proportion of Variance 1.778301e-17
Cumulative Proportion  1.000000e+00

my goal is to plot pca (just Comp.1) vs TSC_8125 (that is cluster number)

Upvotes: 0

Views: 669

Answers (1)

dww
dww

Reputation: 31452

The function princomp() returns a list of 7 elements. These are sdev, loadings, center, scale, n.obs, scores, and call. You can find a description of these in the function help page (which you can access by typing ?princomp). Depending on the purpose of your plot, the one of interest here is probably scores.

scores: the scores of the supplied data on the principal components.

loadings: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).

The simplest way to access the elements of the list is via the $ operator. Thus, pca$scores or pca$loadings will access these, respectively. The scores and loadings are both of class matrix, with each column corresponding to a principle component (first col is the 1st principle component and so on.)

So, to access the 1st principle component scores, you can use

comp.1 <- pca$scores[,1]

to plot this against cluster number you can use

plot (comp.1 ~ not_null_df$TSC_8125)

or plot it using qplot if you prefer by

qplot(x = not_null_df$TSC_8125, y =  comp.1, main = "Scores for PC1", xlab = "cluster number")

Upvotes: 1

Related Questions