wissem
wissem

Reputation: 58

R PCA makes graph that is fishy, can't ID why

Link to data as txt file here

I'm having trouble with this PCA. PC1 results appear binary, and I can't figure out why as none of my variables are binary.

df = bees

pca_dat_condition <- bees %>% ungroup() %>%
  select(Length.1:Length.25, OBJECTID, Local, Elevation, Longitude, 
  Latitude, Cubital.Index)   %>% 
  na.omit()

pca_dat_first <- pca_dat_condition %>%      #remove the final nonnumerical information 
  select(-Local, -OBJECTID, -Elevation, -Longitude, -Latitude) 

pca <- pca_dat_first%>%   
  scale()  %>%
  prcomp()

# add identifying information back into PCA data
pca_data <- data.frame(pca$x, Local=pca_dat_condition$Local, ID = 
pca_dat_condition$OBJECTID, elevation = pca_dat_condition$Elevation, 
    Longitude = pca_dat_condition$Longitude, Latitude = 
    pca_dat_condition$Latitude)
ggplot(pca_data, aes(x=PC1, y=PC2, color = Latitude)) + 
   geom_point() +ggtitle("PC1 vs PC2: All Individuals") +
   scale_colour_gradient(low = "blue", high = "red")

I'm not getting any error messages with the code, and when I look at the data frame nothing looks out of place. Should I be using a different function for the PCA? Any insight into why my graph may look like this?

Previously, I did the same PCA but for the average values for each Local (whereas this is each individual), and it came out as a normal PCA with no clear clustering. I don't understand why this problem would arise when looking at individual points. It's possible I merged some other data frames in a wonky way, but the structure of the dataset seems completely normal.

This is how the PCA looks.

Upvotes: 1

Views: 86

Answers (1)

AkselA
AkselA

Reputation: 8837

bees <- read.csv(paste0("https://gist.githubusercontent.com/AkselA/", 
                    "08a4e78a6a29a918ed597e9a32adc228/raw/", 
                    "6d0005fad4cb91830bcf7087176283b18683e9cd/bees.csv"), 
                    header=TRUE)

# bees <- bees[bees[,1] < 10,]  # This will remove the three offending rows
bees <- na.omit(bees)

bees.cond <- bees[, grep("Length|OBJ|Loc|Ele|Lon|Lat|Cubi", colnames(bees))]

bees.first <- bees[, grep("Length|Cubi", colnames(bees))]
summary(bees.first)
par(mfrow=c(7, 4), mar=rep(1, 4))
q <- lapply(1:ncol(bees.first), function(x) {
    h <- hist(scale(bees.first[, x]), plot=FALSE)
    h$counts <- log1p(h$counts)
    plot(h, main="", axes=FALSE, ann=FALSE)
    legend("topright", legend=names(bees.first[x]), 
      bty="n", cex=0.8, adj=c(0, -2), xpd=NA)
    })

bees.pca <- prcomp(bees.first, scale.=TRUE)
biplot(bees.pca)

Before removing the outliers

enter image description here

enter image description here

After

enter image description here

enter image description here

Upvotes: 2

Related Questions