user2117258
user2117258

Reputation: 515

Extracting k-means cluster-specific features

I am clustering some gene expression data using k-means in the reduced, PCA space and now I want to extract distinct features that best describe each cluster. These are features that are highly expressed within each cluster.

I've posted below a reproducible example to show my logic and where I've left off.

# Create test matrix
test = matrix(rnorm(200), 20, 10)
test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3
test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2
test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4
colnames(test) = paste("Cell", 1:10, sep = "")
rownames(test) = paste("Gene", 1:20, sep = "")

# plot the inital heatmap
library(pheatmap)
pheatmap(t(test))

# preform PCA
pca = prcomp(t(test), center=TRUE, scale=TRUE)
rotation = data.frame(pca$x)
plot(rotation[1:3], pch=16, cex=0.6, cex.main=0.9)

# preform Kmeans in PCA space
wss = (nrow(rotation)-1)*sum(apply(rotation,2,var))
for (i in 2:9) wss[i] <- sum(kmeans(rotation, centers=i)$withinss)
plot(1:9, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
km = kmeans(rotation, 2)
cluster_assignment = as.factor(km$cluster)

# plot k-means cluster assignment in PCA space
library(ggplot2)
ggplot(rotation, aes(rotation$PC1, rotation$PC2, color=cluster_assignment, label=rownames(rotation))) + geom_point() + geom_text()

# create a cluster annotation data.frame
n_clusters = length(unique(km$cluster))   
temp_cluster_design_list = list()
for(n in 1:n_clusters){
  temp_cluster_design = data.frame(row.names=row.names(t(test)[cluster_assignment %in% n,]))
  temp_cluster_design$cluster = n 
  temp_cluster_design_list[[n+1]] <- temp_cluster_design
}
cluster_design = do.call(rbind, temp_cluster_design_list)

# cluster_design looks something like this: 
#        cluster
# Cell1        1
# Cell3        1
# Cell5        1
# Cell7        1
# Cell9        1
# Cell2        2
# Cell4        2
# Cell6        2
# Cell8        2
# Cell10       2

# heatmap with cell cluster annotation
PCA_heatmap_data = t(test)[row.names(cluster_design),]
cluster_design$cluster = as.factor(cluster_design$cluster)
pheatmap(PCA_heatmap_data, annotation_row=cluster_design)


# extract out expressed genes from each cluster
for(n in 1:n_clusters) {
    temp_cluster = t(test)[cluster_assignment %in% n,]
#     ??? somehow extract out the expressed gene names specific to each cluster 
}

The code above ultimately leaves me with a heatmap that looks something like this. Now, what I'd like to do with each cluster in the heatmap is to extract out gene names that are highly expressed. I ultimately would like to write out a table that looks something like this:

GENE      CLUSTER
Gene20    cluster2
Gene19    cluster2
Gene15    cluster2
Gene18    cluster2
Gene16    cluster2
Gene17    cluster2
Gene9     cluster1
Gene8     cluster1
Gene4     cluster1
Gene3     cluster1
...       ...

I'm not sure of the best and most efficient way to go about this. I would appreciate any help you could toss my way! Thank you!

EDIT

How do we define highly expressed? This I am not quite sure and hoping to get some insight. Perhaps comparing the average for each gene across all cells and comparing that to the average among the cluster? This may work but I think this is prone to be skewed by outliers. Another thought would be to scale each cluster and take genes that are significantly highly expressed?

How would I go about identifying highly expressed genes among clusters that are very similar? For example, this heatmap shows three cluster but the red and green clusters are very similar. It appears that Gene9 is cluster2 specific?

Upvotes: 1

Views: 2029

Answers (1)

Hack-R
Hack-R

Reputation: 23200

threshold <- 1
nms       <- as.character()
for(n in 1:n_clusters) {

  for(i in 1:ncol(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])){
     if(mean(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),i]) > threshold){
             nms <- c(nms, colnames(t(test)[row.names(t(test)) %in% names(cluster_assignment[cluster_assignment == n]),])[i])
     }
  }
  if(n == 1) result <- data.frame(GENE = nms, CLUSTER = rep(n,length(nms))); rm(nms); nms <- as.character()
  if(n >  1) result <- rbind(result, data.frame(GENE = nms, CLUSTER = rep(n,length(nms))))
}

result
     GENE CLUSTER
1   Gene7       1
2  Gene11       1
3  Gene12       1
4  Gene13       1
5  Gene14       1
6  Gene15       1
7  Gene16       1
8  Gene17       1
9  Gene18       1
10 Gene19       1
11 Gene20       1
12  Gene1       2
13  Gene2       2
14  Gene3       2
15  Gene4       2
16  Gene5       2
17  Gene6       2
18  Gene7       2
19  Gene8       2
20  Gene9       2
21 Gene10       2

I left the threshold as a parameter so that you can define it however you like. In this example I used 1 as the threshold.

Upvotes: 1

Related Questions