Reputation: 3
I am trying to align two gene sequences by codons based on a protein alignment using Bio.codonalign. Their example is given here (under 'build' function): https://biopython.org/DIST/docs/api/Bio.codonalign-module.html. I've tried out their example and it's worked.
Now, I want the sequences to be obtained from a FASTA file (ap_20 has aligned proteins and ug_20 has unaligned genes). The following is my code.
# Import packages
from Bio.Alphabet import generic_dna, generic_protein
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
from Bio.codonalign import build
# Define set of orthologous genes and proteins
genes = list(SeqIO.parse("ug_20.fasta", "fasta"))
proteins = list(SeqIO.parse("ap_20.fasta", "fasta"))
# Assign individual sequences to variables
seq1 = SeqRecord(Seq(str(genes[0].seq), alphabet=generic_dna), id="pro1")
seq2 = SeqRecord(Seq(str(genes[1].seq), alphabet=generic_dna), id="pro2")
pro1 = SeqRecord(Seq(str(proteins[0].seq), alphabet=generic_protein), id="pro1")
pro2 = SeqRecord(Seq(str(proteins[1].seq), alphabet=generic_protein), id="pro2")
# MultipleSeqAlignment reads the protein alignment
aln = MultipleSeqAlignment([pro1, pro2])
print(aln)
# Build codon alignment
codon_aln = build(aln, [seq1, seq2])
print(codon_aln)
The aln
works, but it is the last build
step that doesn't. I get the following error. I'm not sure what KeyError: 'm'
means, but I know that all my protein sequences start with the letter 'm'. I replaced part of the file path with '...' to keep it short.
Traceback (most recent call last):
File "/Users/.../tempCodeRunnerFile.py", line 30, in <module>
codon_aln = build(aln, [seq1, seq2])
File "/Users/.../anaconda3/lib/python3.6/site-packages/Bio/codonalign/__init__.py", line 168, in build
anchor_len=anchor_len)
File "/Users/.../anaconda3/lib/python3.6/site-packages/Bio/codonalign/__init__.py", line 261, in _check_corr
pro_re += aa2re[aa]
KeyError: 'm'
Upvotes: 0
Views: 693
Reputation: 22827
You don't provide (parts of) your input files ug_20.fasta
and ap_20.fasta
, making it harder for me to debug, but I can trigger a similar error with the following code:
>>> from Bio.Alphabet import generic_dna, generic_protein
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import MultipleSeqAlignment
>>> from Bio.codonalign import build
>>> seq1 = SeqRecord(Seq('ATGTCTCGT', alphabet=generic_dna), id='pro1')
>>> seq2 = SeqRecord(Seq('ATGCGT', alphabet=generic_dna), id='pro2')
>>> pro1 = SeqRecord(Seq('MSR', alphabet=generic_protein), id='pro1')
>>> pro2 = SeqRecord(Seq('m-R', alphabet=generic_protein), id='pro2')
>>> aln = MultipleSeqAlignment([pro1, pro2])
>>> codon_aln = build(aln, [seq1, seq2])
>>> print(codon_aln)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-25-da1b827fb67e> in <module>()
9 pro2 = SeqRecord(Seq('m-R', alphabet=generic_protein), id='pro2')
10 aln = MultipleSeqAlignment([pro1, pro2])
---> 11 codon_aln = build(aln, [seq1, seq2])
12 print(codon_aln)
1 frames
/usr/local/lib/python3.6/dist-packages/Bio/codonalign/__init__.py in _check_corr(pro, nucl, gap_char, codon_table, complete_protein, anchor_len)
259 for aa in pro.seq:
260 if aa != gap_char:
--> 261 pro_re += aa2re[aa]
262
263 nucl_seq = str(nucl.seq.upper().ungap(gap_char))
KeyError: 'm'
This is the default Bio.codealign.build example with only one change: in pro2
I changed 'M-R'
to 'm-R'
. So this suggest to me that one of your protein sequences contains lowercase characters whereas Bio.codealign.build()
seems to expect uppercase characters. You can convert your protein sequences to uppercase like this:
pro1 = SeqRecord(Seq(str(proteins[0].seq.upper()), alphabet=generic_protein), id="pro1")
pro2 = SeqRecord(Seq(str(proteins[1].seq.upper()), alphabet=generic_protein), id="pro2")
Upvotes: 1