strugglebus
strugglebus

Reputation: 45

How to generate similarity scores heatmap from multiple gene lists in R

Problem
I have a melted dataframe of 19 different tumor types, each with associated marker genes. I want to visualize the similarity between tumor types to see how they cluster. I have a plan to attack this problem, but it seems like there should be an easier way.

Dummy Data

>df <- data.frame(tumor_type = c("tumor1", "tumor1", "tumor1", "tumor2", "tumor2", "tumor3", "tumor4", "tumor4"), genes = c("geneA", "geneB", "geneC", "geneA", "geneD", "geneD", "geneA", "geneD"))

>df
tumor_type  genes
tumor1      geneA
tumor1      geneB
tumor1      geneC
tumor2      geneA
tumor2      geneD
tumor3      geneD
tumor4      geneA
tumor4      geneD

Proposed solution
1) Break melted dataframe into individual tumor lists
2) Calculate pairwise similarity scores between all combinations of tumors. I'll have to do some kind of for loop using (intersect(tumor1, tumor2)/(intersect(tumor1, tumor2) + setdiff(tumor1, tumor2) + setdiff(tumor2, tumor1))*100.
Should get a matrix like:

>dfmatrix
       tumor1   tumor2   tumor3   tumor4
tumor1    100       25        0       25
tumor2     25      100       50       50
tumor3      0       50      100       50  
tumor4     25       50       50      100

3) I'll then do a standard heatmap

I will need help figuring out the individual components (like how to do the loop to do all pairwise comparisons), but I thought I should start at a higher level to make sure that my thinking about this process is correct before asking a bunch of different questions on the details.

Upvotes: 1

Views: 1625

Answers (2)

StupidWolf
StupidWolf

Reputation: 46898

This is a very simplified solution, maybe just for exploring the data. You simplify the problem to asking which gene is associated with each tumour, in a binary manner:

table(df$tumor_type,df$genes)
         geneA geneB geneC geneD
  tumor1     1     1     1     0
  tumor2     1     0     0     1
  tumor3     0     0     0     1
  tumor4     1     0     0     1

Then we can use a pairwise distance using dist:

D = dist(table(df$tumor_type,df$genes),method="binary")
       tumor1 tumor2 tumor3
tumor2   0.75              
tumor3   1.00   0.50       
tumor4   0.75   0.00   0.50

Or if you prefer other measurements, you can do:

library(ade4)
D = dist.binary(unclass(table(df$tumor_type,df$genes)),method=1)

Then just visualize 1-distance

library(pheatmap)
pheatmap(1-as.matrix(D))

enter image description here

Upvotes: 2

user12728748
user12728748

Reputation: 8506

Assuming that by "19 different tumor types" can be represented as 19 samples, so that you can create a n_genes x 19 expression matrix, you could use dcast to generate the matrix, then generate pairwise correlation heatmaps.

You may have to think about ways to deal with missing data to get appropriate similarity scores.

Assuming a complete matrix, you could just use the dist function, for example:

library(data.table)
library(pheatmap)

# mock data
set.seed(1)
mat <- matrix(
    stats::runif(1000, 3, 14),
    nrow = 100,
    ncol = 10,
    dimnames = list(paste0("gene", 1:100), paste0("Sample", 1:10))
)
modmat <- base::sample(1:100, 30)
mat[modmat, 1:5] <- mat[modmat, 1:5] + stats::runif(150, 4, 6)
MAT <- melt(data.table(mat, keep.rownames = TRUE), id.vars = "rn")
# MAT would correspond to your melted data.frame, after setDT(your.df)

mat <- as.matrix(dcast(MAT, rn ~ variable), rownames = 1)
cmat <- as.matrix(dist(t(mat), diag=TRUE, upper=TRUE))

pheatmap(cmat)

Created on 2020-04-09 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions