Reputation: 1098
I have two strings in a file like this:
>1
atggca---------gtgtggcaatcggcacat
>2
atggca---------gtgtggcaatcggcacat
Using the AlignIO function in Biopython:
from Bio import AlignIO
print AlignIO.read("neighbor.fas", "fasta")
returns this:
SingleLetterAlphabet() alignment with 2 rows and 33 columns
atggca---------gtgtggcaatcggcacat 1
atggca---------gtgtggcaatcggcacat 2
I want to calculate the percentage identity between the two rows in this alignment.
row = align[:,n]
allows for the extraction of individual columns that can be compared.
Columns that contain only "-" should not be counted.
Upvotes: 2
Views: 9794
Reputation: 59
If you wanted to extend it to more than two sequences the following works well, it gives the same result as the BioPerl overall_percentage_identity function (http://search.cpan.org/dist/BioPerl/Bio/SimpleAlign.pm).
from Bio import AlignIO
align = AlignIO.read("neighbor.fas", "fasta")
print perc_identity(align)
def perc_identity(aln):
i = 0
for a in range(0,len(aln[0])):
s = aln[:,a]
if s == len(s) * s[0]:
i += 1
return 100*i/float(len(aln[0]))
Upvotes: 2
Reputation: 1
I know this question is old but, since you are already in biopython, couldn't you just move along with the BLAST record class (chapter 7 of the tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html)?
I believe the option you need (under "7.4 The BLAST record class") is "hsp.identities".
Upvotes: -1
Reputation: 2202
This is a fast but not biologically accurate answer.
use Levenshtein Python extension and C library.
http://code.google.com/p/pylevenshtein/
The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.
Since these sequences are strings, why not!
sudo pip install python-Levenshtein
then fire up ipython:
In [1]: import Levenshtein
In [3]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
'atggca---------gtgtggcaatcggcacat'.replace('-','')) * 100
Out[3]: 100.0
In [4]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
'atggca---------gtgtggcaatcggcacaa'.replace('-','')) * 100
Out[4]: 95.83333333333334
Upvotes: 3