Reputation: 769
I have a data set with 10 dimension as feature and 1 dimension as cluster number (11 dimension together). how can I plot the PCA of my data (PC1) vs cluster number using R?
qplot(x = not_null_df$TSC_8125, y = pca, data = subset(not_null_df, select = c (not_null_df$AVG_ERTEBAT,not_null_df$AVG_ROSHD,not_null_df$AVG_HOGHOGH,not_null_df$AVG_MM,not_null_df$AVG_MK,not_null_df$AVG_TM,not_null_df$AVG_VEJHE,not_null_df$AVG_ANGIZEH,not_null_df$AVG_TAHOD)), main = "Loadings for PC1", xlab = "cluster number")
actually I wrote this part of code, and I got this error:
Don't know how to automatically pick scale for object of type princomp. Defaulting to continuous.
Error: Aesthetics must be either length 1 or the same as the data (564): x, y
summary(not_null_df)
ï..QN NAMECODE GENDER VAZEYATTAAHOL TAHSILAT SEN SABEGHE
Min. : 1.00 Min. : 1.0 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.: 28.00 1st Qu.:11.0 1st Qu.:1.000 1st Qu.:1.75 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
Median : 60.00 Median :13.0 Median :1.000 Median :2.00 Median :3.000 Median :1.000 Median :1.000
Mean : 68.63 Mean :11.7 Mean :1.152 Mean :1.75 Mean :2.578 Mean :1.394 Mean :1.121
3rd Qu.:103.25 3rd Qu.:14.0 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:1.000
Max. :190.00 Max. :16.0 Max. :2.000 Max. :2.00 Max. :3.000 Max. :3.000 Max. :3.000
AVG_ERTEBAT AVG_ROSHD AVG_HOGHOGH AVG_MM AVG_MK AVG_TM AVG_VEJHE
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 5.333 1st Qu.: 4.125 1st Qu.: 1.750 1st Qu.: 5.000 1st Qu.: 3.125 1st Qu.: 5.981 1st Qu.: 4.556
Median : 7.000 Median : 5.875 Median : 3.500 Median : 7.727 Median : 5.000 Median : 8.000 Median : 6.333
Mean : 6.730 Mean : 5.787 Mean : 4.001 Mean : 6.903 Mean : 4.890 Mean : 7.390 Mean : 6.095
3rd Qu.: 8.425 3rd Qu.: 7.656 3rd Qu.: 6.000 3rd Qu.: 9.182 3rd Qu.: 6.688 3rd Qu.: 9.204 3rd Qu.: 7.778
Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
AVG_ANGIZEH AVG_TAHOD AVG_SOALAT TSC_8125 avg
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :1.000 Min. :0.000
1st Qu.: 5.000 1st Qu.: 5.833 1st Qu.: 4.000 1st Qu.:1.000 1st Qu.:4.788
Median : 7.000 Median : 7.667 Median : 7.000 Median :2.000 Median :6.301
Mean : 6.549 Mean : 7.171 Mean : 6.025 Mean :2.046 Mean :6.154
3rd Qu.: 8.750 3rd Qu.: 9.000 3rd Qu.: 8.000 3rd Qu.:3.000 3rd Qu.:7.599
Max. :10.000 Max. :10.000 Max. :10.000 Max. :3.000 Max. :9.978
and I can get pca by this code:
pca <- princomp(not_null_df, cor=TRUE, scores=TRUE)
summary(pca)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
Standard deviation 2.887437 1.28937443 1.12619079 1.08816449 0.98432226 0.91257779 0.90980017 0.82303807 0.74435256
Proportion of Variance 0.438805 0.08749929 0.06675293 0.06232116 0.05099423 0.04383149 0.04356507 0.03565219 0.02916109
Cumulative Proportion 0.438805 0.52630426 0.59305720 0.65537835 0.70637258 0.75020406 0.79376914 0.82942133 0.85858242
Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17 Comp.18
Standard deviation 0.70304085 0.67709130 0.62905993 0.59284646 0.50799135 0.48013732 0.4476952 0.39317004 0.378722707
Proportion of Variance 0.02601402 0.02412909 0.02082718 0.01849826 0.01358185 0.01213325 0.0105490 0.00813593 0.007548994
Cumulative Proportion 0.88459644 0.90872553 0.92955271 0.94805097 0.96163282 0.97376607 0.9843151 0.99245101 1.000000000
Comp.19
Standard deviation 1.838143e-08
Proportion of Variance 1.778301e-17
Cumulative Proportion 1.000000e+00
my goal is to plot pca (just Comp.1
) vs TSC_8125 (that is cluster number)
Upvotes: 0
Views: 669
Reputation: 31452
The function princomp() returns a list of 7 elements. These are sdev, loadings, center, scale, n.obs, scores, and call. You can find a description of these in the function help page (which you can access by typing ?princomp). Depending on the purpose of your plot, the one of interest here is probably scores.
scores: the scores of the supplied data on the principal components.
loadings: the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).
The simplest way to access the elements of the list is via the $ operator. Thus, pca$scores or pca$loadings will access these, respectively. The scores and loadings are both of class matrix, with each column corresponding to a principle component (first col is the 1st principle component and so on.)
So, to access the 1st principle component scores, you can use
comp.1 <- pca$scores[,1]
to plot this against cluster number you can use
plot (comp.1 ~ not_null_df$TSC_8125)
or plot it using qplot if you prefer by
qplot(x = not_null_df$TSC_8125, y = comp.1, main = "Scores for PC1", xlab = "cluster number")
Upvotes: 1