How to generate similarity scores heatmap from multiple gene lists in R

Question

Problem
I have a melted dataframe of 19 different tumor types, each with associated marker genes. I want to visualize the similarity between tumor types to see how they cluster. I have a plan to attack this problem, but it seems like there should be an easier way.

Dummy Data

>df <- data.frame(tumor_type = c("tumor1", "tumor1", "tumor1", "tumor2", "tumor2", "tumor3", "tumor4", "tumor4"), genes = c("geneA", "geneB", "geneC", "geneA", "geneD", "geneD", "geneA", "geneD"))

>df
tumor_type  genes
tumor1      geneA
tumor1      geneB
tumor1      geneC
tumor2      geneA
tumor2      geneD
tumor3      geneD
tumor4      geneA
tumor4      geneD

Proposed solution
1) Break melted dataframe into individual tumor lists
2) Calculate pairwise similarity scores between all combinations of tumors. I'll have to do some kind of for loop using (intersect(tumor1, tumor2)/(intersect(tumor1, tumor2) + setdiff(tumor1, tumor2) + setdiff(tumor2, tumor1))*100.
Should get a matrix like:

>dfmatrix
       tumor1   tumor2   tumor3   tumor4
tumor1    100       25        0       25
tumor2     25      100       50       50
tumor3      0       50      100       50  
tumor4     25       50       50      100

3) I'll then do a standard heatmap

I will need help figuring out the individual components (like how to do the loop to do all pairwise comparisons), but I thought I should start at a higher level to make sure that my thinking about this process is correct before asking a bunch of different questions on the details.

StupidWolf · Accepted Answer

This is a very simplified solution, maybe just for exploring the data. You simplify the problem to asking which gene is associated with each tumour, in a binary manner:

table(df$tumor_type,df$genes)
         geneA geneB geneC geneD
  tumor1     1     1     1     0
  tumor2     1     0     0     1
  tumor3     0     0     0     1
  tumor4     1     0     0     1

Then we can use a pairwise distance using dist:

D = dist(table(df$tumor_type,df$genes),method="binary")
       tumor1 tumor2 tumor3
tumor2   0.75              
tumor3   1.00   0.50       
tumor4   0.75   0.00   0.50

Or if you prefer other measurements, you can do:

library(ade4)
D = dist.binary(unclass(table(df$tumor_type,df$genes)),method=1)

Then just visualize 1-distance

library(pheatmap)
pheatmap(1-as.matrix(D))

How to generate similarity scores heatmap from multiple gene lists in R

Answers (2)

Related Questions