Reputation: 43
I need a methodology for creating a consensus sequence out of 3 - 1000 short (10-20bp) nucleotide ("ATCG") reads of varying lengths.
A simplified example:
"AGGGGC"
"AGGGC"
"AGGGGGC"
"AGGAGC"
"AGGGGG"
Should result in a consensus sequence of "AGGGGC"
.
I have found modules that do multiple sequence alignment (MSA) in the BioPython library, but only for sequences of the same length. I am also familiar with (and have implemented) Smith-Waterman style alignment for two sequences of any length. I imagine there must be a library or implementation that combine these elements (MSA over unequal lentghs), but after hours of scouring the web and various documentation haven't found anything.
Any advice on existing modules/libraries (Python preferred) or programs I can incorporate into a pipeline that do this?
Thanks!
Upvotes: 4
Views: 5288
Reputation: 22827
Add gap characters at the end of your sequences so that they all have the same length. Then you can process them with your program of choice, e.g. MUSCLE:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align.Applications import MuscleCommandline
sequences = ["AGGGGC",
"AGGGC",
"AGGGGGC",
"AGGAGC",
"AGGGGG"]
longest_length = max(len(s) for s in sequences)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences]
records = (SeqRecord(Seq(s)) for s in padded_sequences)
SeqIO.write(records, "msa_example.fasta", "fasta")
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="msa_example.fasta", out="msa.txt")
print cline
Upvotes: 1