Reputation: 149

How to plot distance biplot and correlation biplot results of SVD/PCA in R?

I searched for a long time for a straightforward explanation of the distance vs correlation biplots, as well as an explanation of how to transform the standard outputs of PCA to achieve the two biplots. All the stack overflow explanations 1 2 3 4 I saw went way over my head with math terms. How can I create both a distance biplot and a correlation biplot using the outputs of R's prcomp?

Upvotes: 1

Answers (1)

mikeblazanin

Reputation: 149

The best explanation I found is some lecture slides from Pierre Legendre, Département de sciences biologiques, Université de Montréal (http://biol09.biol.umontreal.ca/PLcourses/Ordination_section_1.1_PCA_Eng.pdf). However, while these slides did show the way to plot a distance and correlation biplot manually, they didn't show how to plot the distance and correlation biplots from the results of prcomp.

So I worked through an example that shows how one can use the outputs of prcomp for them to be equivalent to the example walked through in the pdf above. I am leaving this here for future people like myself who are wondering how to plot a distance vs correlation biplot and when you want to use each (according to Pierre Legendre)

set.seed(1)

#Run standard PCA
pca_res <- prcomp(mtcars[, 1:7], center = TRUE, scale = TRUE, retx = TRUE)

#To print a distance biplot, simply plot pca_red$x as points and $rotation
#as vectors
library(ggplot2)

arrow_len <- 3 #arbitrary scaling of arrows so they're same mag as PC scores
ggplot(data = as.data.frame(pca_res$x), aes(x = PC1, y = PC2)) +
  geom_point() +
  geom_segment(data = as.data.frame(pca_res$rotation),
                    aes(x = 0, y = 0, yend = arrow_len*PC1, xend = arrow_len*PC2),
                    arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(pca_res$rotation),
            mapping = aes(y = arrow_len*PC1, x = arrow_len*PC2,
                label = row.names(pca_res$rotation)))

#This is equivalent to the following steps:
Y_centered <- scale(mtcars[, 1:7], center = TRUE, scale = TRUE)
Y_eig <- eigen(cov(Y_centered)) 
#Note that Y_eig$vectors == pca_res$rotation ("rotations" or "loadings")
# and Y_eig$values (eigenvalues) == pca_res$sdev**2

#For a distance biplot
U_frame <- Y_eig$vectors
#F is your PC scores, achieved by multiplying your original data by the rotations
F_frame <- Y_centered %*% U_frame

#flipping constants if needed bc PC axis direction is arbitrary
x_flip = -1
y_flip = -1
ggplot(data = as.data.frame(F_frame), aes(x = x_flip*V1, y = y_flip*V2)) +
  geom_point() +
  geom_segment(data = as.data.frame(U_frame),
               aes(x = 0, y = 0, yend = y_flip*arrow_len*V1, xend = x_flip*arrow_len*V2),
               arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(U_frame),
            mapping = aes(y = y_flip*arrow_len*V1, x = x_flip*arrow_len*V2,
                          label = colnames(Y_centered)))

#To print a correlation biplot, matrix multiply your rotations/loadings
# by the identity matrix times your PCA standard deviations 
# (equivalent to the sqrt of your eigen values)
U_frame_scaling2 <- U_frame %*% diag(Y_eig$values^(0.5))

#And divide your PC scores by your PCA standard deviations
# (equivalent to 1/sqrt(eigen values)
F_frame_scaling2 <- F_frame %*% diag(Y_eig$values^(-0.5))

#Plot
arrow_len <- 1.5 #arbitrary scaling of arrows so they're same mag as PC scores

ggplot(data = as.data.frame(pca_res$x %*% diag(1/pca_res$sdev)), 
       aes(x = V1, y = V2)) +
  geom_point() +
  geom_segment(data = as.data.frame(pca_res$rotation %*% diag(pca_res$sdev)),
               aes(x = 0, y = 0, yend = arrow_len*V1, xend = arrow_len*V2),
               arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(pca_res$rotation %*% diag(pca_res$sdev)),
            mapping = aes(y = arrow_len*V1, x = arrow_len*V2,
                          label = row.names(pca_res$rotation)))

ggplot(data = as.data.frame(F_frame_scaling2), aes(x = x_flip*V1, y = y_flip*V2)) +
  geom_point() +
  geom_segment(data = as.data.frame(U_frame_scaling2),
               aes(x = 0, y = 0, yend = y_flip*arrow_len*V1, xend = x_flip*arrow_len*V2),
               arrow = arrow(length = unit(0.02, "npc"))) +
  geom_text(data = as.data.frame(U_frame_scaling2),
            mapping = aes(y = y_flip*arrow_len*V1, x = x_flip*arrow_len*V2,
                          label = colnames(Y_centered)))

As for the differences between the two (in case the pdf above becomes unavailable at some point):

Scaling type 1: distance biplot, used when the interest is on the positions of the objects with respect to one another. –

Plot matrices F to represent the objects and U for the variables.

Scaling type 2: correlation biplot, used when the angular relationships among the variables are of primary interest. –

Plot matrices G to represent the objects and Usc2 for the variables, where G = FΛ–1/2 , and Usc2 = UΛ1/2.

In scaling 1 (distance biplot),

the sites have variances, along each axis (or principal component), equal to the axis eigenvalue (column of F);
the eigenvectors (columns of U) are normed to lengths = 1;
the length (norm) of each species vector in the pdimensional ordination space (rows of U) is 1.

In scaling 2 (correlation biplot),

the sites have unit variance along each axis (columns of G);
the eigenvectors (columns of Usc2) are normed to lengths = sqrt(eigenvalues);
the norm of each species vector in the p-dimensional ordination space (rows of Usc2) is its standard deviation.

In scaling 1 (distance biplot),

Distances among objects approximate their Euclidean distances in full multidimensional space.
Projecting an object at right angle on a descriptor approximates the position of the object along that descriptor.
Since descriptors have equal lengths of 1 in the full-dimensional space, the length of the projection of a descriptor in reduced space indicates how much it contributes to the formation of that space.
A scaling 1 biplot thus shows which variables contribute the most to the ordination in a few dimensions (see also section: Equilibrium contribution of variables).
The descriptor-axes are orthogonal (90°) to one another in multidimensional space. These right angles, projected in reduced space, do not reflect the variables’ correlations.

In scaling 2 (correlation biplot),

Distances among objects approximate their Mahalanobis distances in full multidimensional space.
Projecting an object at right angle on a descriptor approximates the position of the object along that descriptor.
Since descriptors have lengths sj in full-dimensional space, the length of the projection of a descriptor j in reduced space is an approximation of its standard deviation sj . Note: sj is 1 when the variables have been standardized.
The angles between descriptors in the biplot reflect their correlations.
When the distance relationships among objects are important for interpretation, this type of biplot is inadequate; a distance biplot should be used.

Upvotes: 2

How to plot distance biplot and correlation biplot results of SVD/PCA in R?

Answers (1)

Related Questions