Reputation: 515
I am clustering some gene expression data using k-means in the reduced, PCA space and now I want to extract distinct features that best describe each cluster. These are features that are highly expressed within each cluster.
I've posted below a reproducible example to show my logic and where I've left off.
# Create test matrix
test = matrix(rnorm(200), 20, 10)
test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3
test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2
test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4
colnames(test) = paste("Cell", 1:10, sep = "")
rownames(test) = paste("Gene", 1:20, sep = "")
# plot the inital heatmap
library(pheatmap)
pheatmap(t(test))
# preform PCA
pca = prcomp(t(test), center=TRUE, scale=TRUE)
rotation = data.frame(pca$x)
plot(rotation[1:3], pch=16, cex=0.6, cex.main=0.9)
# preform Kmeans in PCA space
wss = (nrow(rotation)-1)*sum(apply(rotation,2,var))
for (i in 2:9) wss[i] <- sum(kmeans(rotation, centers=i)$withinss)
plot(1:9, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
km = kmeans(rotation, 2)
cluster_assignment = as.factor(km$cluster)
# plot k-means cluster assignment in PCA space
library(ggplot2)
ggplot(rotation, aes(rotation$PC1, rotation$PC2, color=cluster_assignment, label=rownames(rotation))) + geom_point() + geom_text()
# create a cluster annotation data.frame
n_clusters = length(unique(km$cluster))
temp_cluster_design_list = list()
for(n in 1:n_clusters){
temp_cluster_design = data.frame(row.names=row.names(t(test)[cluster_assignment %in% n,]))
temp_cluster_design$cluster = n
temp_cluster_design_list[[n+1]] <- temp_cluster_design
}
cluster_design = do.call(rbind, temp_cluster_design_list)
# cluster_design looks something like this:
# cluster
# Cell1 1
# Cell3 1
# Cell5 1
# Cell7 1
# Cell9 1
# Cell2 2
# Cell4 2
# Cell6 2
# Cell8 2
# Cell10 2
# heatmap with cell cluster annotation
PCA_heatmap_data = t(test)[row.names(cluster_design),]
cluster_design$cluster = as.factor(cluster_design$cluster)
pheatmap(PCA_heatmap_data, annotation_row=cluster_design)
# extract out expressed genes from each cluster
for(n in 1:n_clusters) {
temp_cluster = t(test)[cluster_assignment %in% n,]
# ??? somehow extract out the expressed gene names specific to each cluster
}
The code above ultimately leaves me with a heatmap that looks something like this. Now, what I'd like to do with each cluster in the heatmap is to extract out gene names that are highly expressed. I ultimately would like to write out a table that looks something like this:
GENE CLUSTER
Gene20 cluster2
Gene19 cluster2
Gene15 cluster2
Gene18 cluster2
Gene16 cluster2
Gene17 cluster2
Gene9 cluster1
Gene8 cluster1
Gene4 cluster1
Gene3 cluster1
... ...
I'm not sure of the best and most efficient way to go about this. I would appreciate any help you could toss my way! Thank you!
EDIT
How do we define highly expressed? This I am not quite sure and hoping to get some insight. Perhaps comparing the average for each gene across all cells and comparing that to the average among the cluster? This may work but I think this is prone to be skewed by outliers. Another thought would be to scale each cluster and take genes that are significantly highly expressed?
How would I go about identifying highly expressed genes among clusters that are very similar? For example, this heatmap shows three cluster but the red and green clusters are very similar. It appears that Gene9 is cluster2 specific?
Upvotes: 1
Views: 2029
Reputation: 23200
threshold <- 1
nms <- as.character()
for(n in 1:n_clusters) {
for(i in 1:ncol(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])){
if(mean(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),i]) > threshold){
nms <- c(nms, colnames(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])[i])
}
}
if(n == 1) result <- data.frame(GENE = nms, CLUSTER = rep(n,length(nms))); rm(nms); nms <- as.character()
if(n > 1) result <- rbind(result, data.frame(GENE = nms, CLUSTER = rep(n,length(nms))))
}
result
GENE CLUSTER 1 Gene7 1 2 Gene11 1 3 Gene12 1 4 Gene13 1 5 Gene14 1 6 Gene15 1 7 Gene16 1 8 Gene17 1 9 Gene18 1 10 Gene19 1 11 Gene20 1 12 Gene1 2 13 Gene2 2 14 Gene3 2 15 Gene4 2 16 Gene5 2 17 Gene6 2 18 Gene7 2 19 Gene8 2 20 Gene9 2 21 Gene10 2
I left the threshold
as a parameter so that you can define it however you like. In this example I used 1 as the threshold.
Upvotes: 1