How to parse ambiguous characters with BioPython

Question

I'm new to Python and to programming in general. I've installed BioPython in hopes that some of its components can help with a script that I'm working on. That script needs to handle many xread files, which each contain a matrix that I need to slice in several ways. I'm hoping that there already exists a sequence datatype or class (is there a difference?) that allows indexing in the odd ways required by sequences with ambiguous characters coded in formats other than IUPAC. For example, in the sequence.

2-123[01]3-22

The characters in the string literal [01] represent a single ambiguous character, either 0 or 1, in the DNA sequence represented. So the slice [-6:] should return 3[01]3-22. I haven't been able to find anything on this in the BioPython documentation, though I may have overlooked it. If there is something in BioPython that will do this, could you please point me toward the relevant documentation?

Thanks.

Lev Levitsky · Accepted Answer

I'm not a BioPython expert, but you could define your own class to work the way you need. You'll need to parse it first, perhaps using regular expressions. For example:

import re
class Sequence(list):
    def __init__(self, s):
        if isinstance(s, str):
            self.extend(re.findall(r'[^]|$$\d+$$', s))
        else:
            list.__init__(self, s)
    def __str__(self):
        return ''.join(self)
    def __getslice__(self, i, j):
        l = list(self)
        return Sequence(l[i:j])

Testing it:

In [1]: seq = Sequence('2-123[01]3-22')

It's a list inside...

In [2]: seq
Out[2]: ['2', '-', '1', '2', '3', '[01]', '3', '-', '2', '2']

But behaves like a string!

In [3]: print seq
2-123[01]3-22
In [4]: print seq[-6:]
3[01]3-22

Maybe you'll need to define some other methods to get the desired behavior.

How to parse ambiguous characters with BioPython

Answers (1)

Related Questions