Reputation: 1098

Calculate percent identity between two already-aligned sequences

I have two strings in a file like this:

>1
atggca---------gtgtggcaatcggcacat
>2
atggca---------gtgtggcaatcggcacat

Using the AlignIO function in Biopython:

from Bio import AlignIO
print AlignIO.read("neighbor.fas", "fasta")

returns this:

SingleLetterAlphabet() alignment with 2 rows and 33 columns
atggca---------gtgtggcaatcggcacat 1
atggca---------gtgtggcaatcggcacat 2

I want to calculate the percentage identity between the two rows in this alignment.

row = align[:,n]

allows for the extraction of individual columns that can be compared.

Columns that contain only "-" should not be counted.

Upvotes: 2

Answers (3)

Teri Forey

Reputation: 59

If you wanted to extend it to more than two sequences the following works well, it gives the same result as the BioPerl overall_percentage_identity function (http://search.cpan.org/dist/BioPerl/Bio/SimpleAlign.pm).

from Bio import AlignIO

align = AlignIO.read("neighbor.fas", "fasta")
print perc_identity(align)

def perc_identity(aln):
    i = 0
    for a in range(0,len(aln[0])):
        s = aln[:,a]
        if s == len(s) * s[0]:
            i += 1
    return 100*i/float(len(aln[0]))

Upvotes: 2

Cat

Reputation: 1

I know this question is old but, since you are already in biopython, couldn't you just move along with the BLAST record class (chapter 7 of the tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html)?

I believe the option you need (under "7.4 The BLAST record class") is "hsp.identities".

Upvotes: -1

Tim

Reputation: 2202

This is a fast but not biologically accurate answer.

use Levenshtein Python extension and C library.

http://code.google.com/p/pylevenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

Since these sequences are strings, why not!

sudo pip install python-Levenshtein

then fire up ipython:

In [1]: import Levenshtein

In [3]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacat'.replace('-','')) * 100
Out[3]: 100.0

In [4]: Levenshtein.ratio('atggca---------gtgtggcaatcggcacat'.replace('-',''),
                          'atggca---------gtgtggcaatcggcacaa'.replace('-','')) * 100
Out[4]: 95.83333333333334

Upvotes: 3

Calculate percent identity between two already-aligned sequences

Answers (3)

Related Questions