Reputation: 45
Problem
I have a melted dataframe of 19 different tumor types, each with associated marker genes. I want to visualize the similarity between tumor types to see how they cluster. I have a plan to attack this problem, but it seems like there should be an easier way.
Dummy Data
>df <- data.frame(tumor_type = c("tumor1", "tumor1", "tumor1", "tumor2", "tumor2", "tumor3", "tumor4", "tumor4"), genes = c("geneA", "geneB", "geneC", "geneA", "geneD", "geneD", "geneA", "geneD"))
>df
tumor_type genes
tumor1 geneA
tumor1 geneB
tumor1 geneC
tumor2 geneA
tumor2 geneD
tumor3 geneD
tumor4 geneA
tumor4 geneD
Proposed solution
1) Break melted dataframe into individual tumor lists
2) Calculate pairwise similarity scores between all combinations of tumors. I'll have to do some kind of for loop using (intersect(tumor1, tumor2)/(intersect(tumor1, tumor2) + setdiff(tumor1, tumor2) + setdiff(tumor2, tumor1))*100.
Should get a matrix like:
>dfmatrix
tumor1 tumor2 tumor3 tumor4
tumor1 100 25 0 25
tumor2 25 100 50 50
tumor3 0 50 100 50
tumor4 25 50 50 100
3) I'll then do a standard heatmap
I will need help figuring out the individual components (like how to do the loop to do all pairwise comparisons), but I thought I should start at a higher level to make sure that my thinking about this process is correct before asking a bunch of different questions on the details.
Upvotes: 1
Views: 1625
Reputation: 46898
This is a very simplified solution, maybe just for exploring the data. You simplify the problem to asking which gene is associated with each tumour, in a binary manner:
table(df$tumor_type,df$genes)
geneA geneB geneC geneD
tumor1 1 1 1 0
tumor2 1 0 0 1
tumor3 0 0 0 1
tumor4 1 0 0 1
Then we can use a pairwise distance using dist:
D = dist(table(df$tumor_type,df$genes),method="binary")
tumor1 tumor2 tumor3
tumor2 0.75
tumor3 1.00 0.50
tumor4 0.75 0.00 0.50
Or if you prefer other measurements, you can do:
library(ade4)
D = dist.binary(unclass(table(df$tumor_type,df$genes)),method=1)
Then just visualize 1-distance
library(pheatmap)
pheatmap(1-as.matrix(D))
Upvotes: 2
Reputation: 8506
Assuming that by "19 different tumor types" can be represented as 19 samples, so that you can create a n_genes x 19 expression matrix, you could use dcast
to generate the matrix, then generate pairwise correlation heatmaps.
You may have to think about ways to deal with missing data to get appropriate similarity scores.
Assuming a complete matrix, you could just use the dist function, for example:
library(data.table)
library(pheatmap)
# mock data
set.seed(1)
mat <- matrix(
stats::runif(1000, 3, 14),
nrow = 100,
ncol = 10,
dimnames = list(paste0("gene", 1:100), paste0("Sample", 1:10))
)
modmat <- base::sample(1:100, 30)
mat[modmat, 1:5] <- mat[modmat, 1:5] + stats::runif(150, 4, 6)
MAT <- melt(data.table(mat, keep.rownames = TRUE), id.vars = "rn")
# MAT would correspond to your melted data.frame, after setDT(your.df)
mat <- as.matrix(dcast(MAT, rn ~ variable), rownames = 1)
cmat <- as.matrix(dist(t(mat), diag=TRUE, upper=TRUE))
pheatmap(cmat)
Created on 2020-04-09 by the reprex package (v0.3.0)
Upvotes: 1