Reputation: 2649
Given a random sequence, how can I check if that sequence is protein or not?
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_prot = Seq("'TGEKPYVCQECGKAFNCSSYLSKHQR")
my_prot
my_prot.alphabet #How to make a check here ??
Upvotes: 2
Views: 3525
Reputation: 3041
Apparently Biopython removed Bio.Alphabet
copying from https://www.biostars.org/p/102/
You can use:
import re
from Bio.Seq import Seq
def validate(seq, alphabet='dna'):
alphabets = {'dna': re.compile('^[acgtn]*$', re.I),
'protein': re.compile('^[acdefghiklmnpqrstvwy]*$', re.I)}
if alphabets[alphabet].search(seq) is not None:
return True
else:
return False
dataz = 'AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG'
pippo = Seq(dataz)
print(pippo, type(pippo))
print(validate(str(pippo), 'dna'))
print(validate(str(pippo), 'protein'))
dataz = 'atg'
pippo = Seq(dataz)
print(pippo, type(pippo))
print(validate(str(pippo), 'dna'))
print(validate(str(pippo), 'protein'))
output:
AAAAAAACCCCCCCCCCCCCCDDDDDDRRRRRRRREERRRRGGG <class 'Bio.Seq.Seq'>
False
True
atg <class 'Bio.Seq.Seq'>
True
True
Upvotes: 1
Reputation: 17333
If your Seq
object has an assigned alphabet, you can check if that alphabet is a protein alphabet:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC, ProteinAlphabet
my_prot = Seq("TGEKPYVCQECGKAFNCSSYLSKHQR", alphabet=IUPAC.IUPACProtein())
print isinstance(my_prot.alphabet, ProteinAlphabet)
However, if you don't have the alphabet known, you'll have to employ some heuristics to guess whether or not it's a protein sequence. This could be as easy as checking if the sequence is entirely "ATC[GU]", or if it employs other letter codes.
But this isn't perfect. For instance, the sequence "ATCG" could be alanine, threonine, cysteine, glycine (i.e. a protein), or it could be adenine, thymine, cytosine, guanine (DNA). Similarly, "ACG" could be a protein, RNA, or DNA. It's technically impossible to be sure that a sequence is DNA, and not a protein sequence. However, if you have a SeqRecord
or other context for the Seq
, you may be able to check if it's a protein sequence.
Upvotes: 4