PCA analysis wrong output

Question

That's the example data:

structure(c(368113, 87747.35, 508620.5, 370570.5, 87286.5, 612728, 
55029, 358521, 2802880, 2045399.5, 177099, 317974.5, 320687.95, 
6971292.55, 78949, 245415.95, 50148.5, 67992.5, 97634, 56139.5, 
371719.2, 80182.7, 612078.5, 367822.5, 80691, 665190.65, 28283.5, 
309720, 2853241.5, 1584324, 135482.5, 270959, 343879.1, 6748208.5, 
71534.9, 258976, 28911.75, 78306, 56358.7, 46783.5, 320882.85, 
53098.3, 537383.5, 404505.5, 89759.7, 624120.55, 40406, 258183.5, 
3144610.45, 1735583.5, 122013.5, 249741, 362585.35, 5383869.15, 
23172.2, 223704.45, 40543.7, 68522.5, 43187.05, 29745, 356058.5, 
89287.25, 492242.5, 452135.5, 97253.55, 575661.95, 65739.5, 334703.5, 
3136065, 1622936.5, 131381.5, 254362, 311496.3, 5627561, 68210.6, 
264610.1, 45851, 65010.5, 32665.5, 39957.5, 362476.75, 59451.65, 
548279, 345096.5, 93363.5, 596444.2, 11052.5, 252812, 2934035, 
1732707.55, 208409.5, 208076.5, 437764.25, 16195882.45, 77461.25, 
205803.85, 30437.5, 75540, 49576.75, 48878, 340380.5, 43785.35, 
482713, 340315, 64308.5, 517859.85, 11297, 268993.5, 3069028.5, 
1571889, 157561, 217596.5, 400610.65, 5703337.6, 50640.65, 197477.75, 
40070, 66619, 81564.55, 41436.5, 367592.3, 64954.9, 530093, 432025, 
87212.5, 553901.65, 20803.5, 333940.5, 3027254.5, 1494468, 195221, 
222895.5, 494429.45, 7706885.75, 60633.35, 192827.1, 29857.5, 
81001.5, 112588.65, 68904.5, 338822.5, 56868.15, 467350, 314526.5, 
105568, 749456.1, 19597.5, 298939.5, 2993199.2, 1615231.5, 229185.5, 
280433.5, 360156.15, 5254889.1, 79369.5, 175434.05, 40907.05, 
70919, 65720.15, 53054.5), .Dim = c(20L, 8L), .Dimnames = list(
    c("Anne", "Greg", "thomas", "Chris", "Gerard", "Monk", "Mart", 
    "Mutr", "Aeqe", "Tor", "Gaer", "Toaq", "Kolr", "Wera", "Home", 
    "Terlo", "Kulte", "Mercia", "Loki", "Herta"), c("Day_Rep1", 
    "Day_Rep2", "Day_Rep3", "Day_Rep4", "Day2_Rep1", "Day2_Rep2", 
    "Day2_Rep3", "Day2_Rep4")))

I would like to perform a nice PCA analysis. I expect that replicates from Day will be nicely correlated with each other and replicates from Day2 together. I was trying to perform some analysis using the code below:

## log transform
data_log <- log(data[, 1:8])
#vec_EOD_EON
dt_PCA <- prcomp(data_log,
                           center = TRUE,
                           scale. = TRUE)

library(devtools)
install_github("ggbiplot", "vqv")

library(ggbiplot)
g <- ggbiplot(dt_PCA, obs.scale = 1, var.scale = 1, 
              groups = colnames(dt_PCA), ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = "")
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

However, the output is not what I am looking for:

but I am looking for something more like that:

I would like to use dots for each row in the data and different colors for each of the replicate. Would be cool to use the similar colors for Day replicates and as well for Day2.

Obtained data with ggplot:

drmariod · Accepted Answer

Let's imagine you save your data into df.

library(ggplot2)
pc_df <- prcomp(t(df), scale.=TRUE)
pc_table <- as.data.frame(pc_df$x[,1:2]) # extracting 1st and 2nd component

experiment_regex <- '(^[^_]+)_Rep(\d+)' # extracting replicate and condition from your experiment names
pc_table$replicate <- as.factor(sub(experiment_regex,'\2', rownames(pc_table)))
pc_table$condition <- as.factor(sub(experiment_regex,'\1', rownames(pc_table)))

ggplot(pc_table, aes(PC1, PC2, color=condition, shape=replicate)) +
  geom_point() +
  xlab(sprintf('PC1 - %.1f%%', # extracting the percentage of each PC and print it on the axes
               summary(pc_df)$importance[2,1] * 100)) +
  ylab(sprintf('PC2 - %.1f%%', 
               summary(pc_df)$importance[2,2] * 100))

The first thing you have to do, to get your data in the correct shape is to transform it using t(). This might be already what you are looking for.

I prefer to do the plots with my own function and I wrote the steps down to get a nice plot with ggplot2.

UPDATE:

Since you were asking in the comments. Here is an example where an experiment was repeated on a different day. Replicate 1 and 2 on one day, and a few days later replicate 3 and 4. The difference on both days are higher then the changes in the conditions (day has 49% variance, experiment has only 20% variance explained). This is not a good experiment and should be repeated!

PCA analysis wrong output

Answers (1)

Related Questions