Reputation: 449
I have a table of genes and diseases to which they are related.I want to construct a phylogenetic tree and group the genes to their diseases.Below is a sample dataset,where gene1 column belongs to disease1 and gene2 belongs to disease2.Primarily gene1 and gene2 are related to each other,and are mapped to diseases they belong to.
gene1 gene2 disease1 disease2
AGTR1 ACHE cancer tumor
AGTR1 ACHE parkinson's asthma
ALOX5 ADRB1 myocardial infarct heart failure
AR ADORA1 breast cancer anxiety disorder
I want to have a circular phylogenetic tree for my purposes,given in the link below: http://itol.embl.de/itol.cgi
Any suggestions to do this in R or any softwares?
Thanks
Code I am running ,now:
d=read.csv("genes_disease.txt",sep="\t",header=TRUE)
phyl_gad <-as.phylo(hclust(dist(d)))
plot(phyl_gad,type="fan",edge.col=c("red","green","blue","orange","yellow","pink","magenta","white"),show.tip.label=FALSE)
If I do show.tip.label=TRUE,there are too many labels that get plotted and makes the tips cluttered up.
My modified dataset is only two columns now,one for gene,one for disease.
Upvotes: 2
Views: 6146
Reputation: 3033
I think what you wanted to do was not a phylogeny but a clustering by distance. Here is a repeatable example.
library(XML)
library(RCurl)#geturl
library(rlist)
library(plyr)
library(reshape2)
library(ggtree)
#get the genes/ diseases info from internet
#example from http://www.musclegenetable.fr/
urllist<-paste0("http://195.83.227.65/4DACTION/GS/",LETTERS[1:24] )
theurl <- lapply(urllist, function(x) RCurl::getURL(x,.opts = list(ssl.verifypeer = T) ) )# wait
theurl2<-lapply(theurl, function(x) gsub("<span class='Style18'>","__",x))
tables <- lapply(theurl2, function (x) XML::readHTMLTable(x) )
tables2 <- lapply(tables, function(x) rlist::list.clean(x, fun = is.null, recursive = FALSE) )
unlist1 = lapply(tables2, plyr::ldply)
newdf<-do.call(rbind, unlist1)
colnames(newdf)[4]<-"diseases"
colnames(newdf)[2]<-"Gene"
newdf$gene<-sub("([A-z0-9]+)(__)(.*)","\\1",newdf$Gene)
newdf$diseases<-sub("(\\* )","",newdf$diseases, perl=T)
#split info of several diseases per gene, and simplify text
#to allow better clustering
newdf2<-as.data.frame(data.table::setDT(newdf)[, strsplit(as.character(diseases), "* ", fixed=TRUE), by = .(gene, diseases)
][,.(diseases = V1, gene)])
newdf2$disease<-sub("([A-z0-9,\\-\\(\\)\\/ ]+)( \\- )(.*)","\\1",newdf2$diseases)
newdf2$disease<-gsub("[0-9,]","",newdf2$disease)
newdf2$disease<-gsub("( [A-Z]{1,2})$","",newdf2$disease)
newdf2$disease<-gsub("(\\-)","",newdf2$disease)
newdf2$disease<-gsub("\\s*\\([^\\)]+\\)","",newdf2$disease)
newdf2$disease<-gsub("\\s*type.*","",newdf2$disease, ignore.case = T)
newdf2$disease<-gsub("(X{0,3})(IX|IV|V?I{0,3})","", newdf2$disease)
newdf2$disease<-gsub("( [A-z]{1,2})$","",newdf2$disease)
newdf2$disease<-sub("^([a-z])(.*)","\\U\\1\\E\\2",newdf2$disease, perl=T)
newdf2$disease<-trimws(newdf2$disease)
newdf2<-newdf2[,c(2,3)]
#make clustering and tree
newcasted <- reshape2::dcast(newdf2, gene ~ disease)
phyl_gad <-ape::as.phylo(hclust(dist(newcasted)))
#use names of genes and diseases in tree
DT <- data.table::as.data.table(newdf2)
newdf4<-as.data.frame(DT[, lapply(.SD, paste, collapse=","), by = gene, .SDcols = 2])
newdf4$genemerge<-paste(newdf4$gene, newdf4$disease)
phyl_gad$tip.label<-newdf4$genemerge
#plot tree
ggtree::ggtree(phyl_gad, layout = "circular")+ ggtree::geom_tiplab2(offset=0.1, align = F, size=4)
Upvotes: 1
Reputation: 4376
Ah, I've done this before. As Bryan said, you want to use the ape
package. Let's say that you have an hclust
object. For example,
library(ape)
fit<-hclust(d,method='ward')
plot(as.phylo(fit),type='fan',label.offset=0.1,no.margin=TRUE)
If you want to modify the colors of the ends of the trees, you can use cutree
and the tip.color
parameter. This will create a repeating set of colors for the different clusters (e.g., color=c('red','blue')
will have alternating blue and red text for the end of the branches.
nclus=...#insert number of clusters you want to cut to
color=...#insert a vector of colors here
fit<-hclust(d,method='ward')
color_list=rep(color,nclus/length(color))
clus=cutree(fit,nclus)
plot(as.phylo(fit),type='fan',tip.color=color_list[clus],label.offset=0.1,no.margin=TRUE)
I'm not sure what type of clustering method you want to use (I was using Ward's method), but that's how you do it.
Upvotes: 4