Anna Banana
Anna Banana

Reputation: 3

KeyError: 'm' when building a codon alignment using Bio.codonalign

I am trying to align two gene sequences by codons based on a protein alignment using Bio.codonalign. Their example is given here (under 'build' function): https://biopython.org/DIST/docs/api/Bio.codonalign-module.html. I've tried out their example and it's worked.

Now, I want the sequences to be obtained from a FASTA file (ap_20 has aligned proteins and ug_20 has unaligned genes). The following is my code.

# Import packages
from Bio.Alphabet import generic_dna, generic_protein
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
from Bio.codonalign import build

# Define set of orthologous genes and proteins
genes = list(SeqIO.parse("ug_20.fasta", "fasta"))
proteins = list(SeqIO.parse("ap_20.fasta", "fasta"))

# Assign individual sequences to variables
seq1 = SeqRecord(Seq(str(genes[0].seq), alphabet=generic_dna), id="pro1")
seq2 = SeqRecord(Seq(str(genes[1].seq), alphabet=generic_dna), id="pro2")

pro1 = SeqRecord(Seq(str(proteins[0].seq), alphabet=generic_protein), id="pro1")
pro2 = SeqRecord(Seq(str(proteins[1].seq), alphabet=generic_protein), id="pro2")

# MultipleSeqAlignment reads the protein alignment
aln = MultipleSeqAlignment([pro1, pro2])
print(aln)

# Build codon alignment
codon_aln = build(aln, [seq1, seq2])
print(codon_aln)

The aln works, but it is the last build step that doesn't. I get the following error. I'm not sure what KeyError: 'm' means, but I know that all my protein sequences start with the letter 'm'. I replaced part of the file path with '...' to keep it short.

Traceback (most recent call last):
  File "/Users/.../tempCodeRunnerFile.py", line 30, in <module>
    codon_aln = build(aln, [seq1, seq2])
  File "/Users/.../anaconda3/lib/python3.6/site-packages/Bio/codonalign/__init__.py", line 168, in build
    anchor_len=anchor_len)
  File "/Users/.../anaconda3/lib/python3.6/site-packages/Bio/codonalign/__init__.py", line 261, in _check_corr
    pro_re += aa2re[aa]
KeyError: 'm'

Upvotes: 0

Views: 693

Answers (1)

BioGeek
BioGeek

Reputation: 22827

You don't provide (parts of) your input files ug_20.fasta and ap_20.fasta, making it harder for me to debug, but I can trigger a similar error with the following code:

>>> from Bio.Alphabet import generic_dna, generic_protein
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import MultipleSeqAlignment
>>> from Bio.codonalign import build
>>> seq1 = SeqRecord(Seq('ATGTCTCGT', alphabet=generic_dna), id='pro1')
>>> seq2 = SeqRecord(Seq('ATGCGT', alphabet=generic_dna), id='pro2')
>>> pro1 = SeqRecord(Seq('MSR', alphabet=generic_protein), id='pro1')
>>> pro2 = SeqRecord(Seq('m-R', alphabet=generic_protein), id='pro2')
>>> aln = MultipleSeqAlignment([pro1, pro2])
>>> codon_aln = build(aln, [seq1, seq2])
>>> print(codon_aln)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-25-da1b827fb67e> in <module>()
      9 pro2 = SeqRecord(Seq('m-R', alphabet=generic_protein), id='pro2')
     10 aln = MultipleSeqAlignment([pro1, pro2])
---> 11 codon_aln = build(aln, [seq1, seq2])
     12 print(codon_aln)

1 frames
/usr/local/lib/python3.6/dist-packages/Bio/codonalign/__init__.py in _check_corr(pro, nucl, gap_char, codon_table, complete_protein, anchor_len)
    259     for aa in pro.seq:
    260         if aa != gap_char:
--> 261             pro_re += aa2re[aa]
    262 
    263     nucl_seq = str(nucl.seq.upper().ungap(gap_char))

KeyError: 'm'

This is the default Bio.codealign.build example with only one change: in pro2 I changed 'M-R' to 'm-R'. So this suggest to me that one of your protein sequences contains lowercase characters whereas Bio.codealign.build() seems to expect uppercase characters. You can convert your protein sequences to uppercase like this:

pro1 = SeqRecord(Seq(str(proteins[0].seq.upper()), alphabet=generic_protein), id="pro1")
pro2 = SeqRecord(Seq(str(proteins[1].seq.upper()), alphabet=generic_protein), id="pro2")

Upvotes: 1

Related Questions