rodrigarco
rodrigarco

Reputation: 123

How can I plot consensus sequences as a binary heatmap in R

I have many amino acids sequences like in a fasta format that I did multiple sequence alignment. I was trying to plot something like binary code as heatmap. If it had a change it would be red, if it did not change would be yellow, for example.

I came across to msaplot from ggtreepackage. I also checked ggmsa package for that. But so far, I did not get what I wanted. So basically I wanted to:

  1. change the multiple sequence alignment to a binary matrix (if the amino differs from the reference sequence, plot x, if not y)

  2. plot as a heatmap

A multiple sequence alignment is something like this for those who don't know.enter image description here

I know that I should provide some type of data example but I am not sure how to create an example of multiple sequence alignment

if you install ggmsa you can have an example of the data and plot in r using:

protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
ggmsa(protein_sequences, start = 265, end = 300) 

Upvotes: 1

Views: 1373

Answers (1)

StupidWolf
StupidWolf

Reputation: 46898

We read in the alignment:

library(Biostrings)
library(ggmsa)
protein_sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
aln = readAAMultipleAlignment(protein_sequences)
ggmsa(protein_sequences, start = 265, end = 300) 

enter image description here

Set the reference as the 1st sequence, some Rattus, you can also use the consensus with consensusString() :

aln = unmasked(aln)
names(aln)[1]
[1] "PH4H_Rattus_norvegicus"

ref = aln[1]

Here we iterate through the sequence and make the binary for where the sequences are the same as the reference:

bm = sapply(1:length(aln),function(i){
as.numeric(as.matrix(aln[i])==as.matrix(ref))
})

bm = t(bm)
rownames(bm) = names(aln)

The plot you see above has sequences reversed, so to visualize the same thing we reverse it, and also subset on 265 - 300:

library(pheatmap)
pheatmap(bm[nrow(bm):1,265:300],cluster_rows=FALSE,cluster_cols=FALSE)

enter image description here

The last row, is Rattus, the reference, and anything similar to that is read, as you can see in the alignment above, last 4 sequences are identical.

Upvotes: 2

Related Questions