Alexander West
Alexander West

Reputation: 43

Multiple Sequence Alignment with Unequal String Length

I need a methodology for creating a consensus sequence out of 3 - 1000 short (10-20bp) nucleotide ("ATCG") reads of varying lengths.

A simplified example:

"AGGGGC"
"AGGGC"
"AGGGGGC"
"AGGAGC"
"AGGGGG"

Should result in a consensus sequence of "AGGGGC".

I have found modules that do multiple sequence alignment (MSA) in the BioPython library, but only for sequences of the same length. I am also familiar with (and have implemented) Smith-Waterman style alignment for two sequences of any length. I imagine there must be a library or implementation that combine these elements (MSA over unequal lentghs), but after hours of scouring the web and various documentation haven't found anything.

Any advice on existing modules/libraries (Python preferred) or programs I can incorporate into a pipeline that do this?

Thanks!

Upvotes: 4

Views: 5288

Answers (1)

BioGeek
BioGeek

Reputation: 22827

Add gap characters at the end of your sequences so that they all have the same length. Then you can process them with your program of choice, e.g. MUSCLE:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align.Applications import MuscleCommandline

sequences = ["AGGGGC",
             "AGGGC",
             "AGGGGGC",
             "AGGAGC",
             "AGGGGG"]

longest_length = max(len(s) for s in sequences)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences]
records = (SeqRecord(Seq(s)) for s in padded_sequences)

SeqIO.write(records, "msa_example.fasta", "fasta")

from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="msa_example.fasta", out="msa.txt")
print cline

Upvotes: 1

Related Questions