Reputation: 4279
I'm new to Python and to programming in general. I've installed BioPython in hopes that some of its components can help with a script that I'm working on. That script needs to handle many xread files, which each contain a matrix that I need to slice in several ways. I'm hoping that there already exists a sequence datatype or class (is there a difference?) that allows indexing in the odd ways required by sequences with ambiguous characters coded in formats other than IUPAC. For example, in the sequence.
2-123[01]3-22
The characters in the string literal [01]
represent a single ambiguous character, either 0
or 1
, in the DNA sequence represented. So the slice [-6:]
should return 3[01]3-22
. I haven't been able to find anything on this in the BioPython documentation, though I may have overlooked it. If there is something in BioPython that will do this, could you please point me toward the relevant documentation?
Thanks.
Upvotes: 3
Views: 498
Reputation: 65791
I'm not a BioPython expert, but you could define your own class to work the way you need. You'll need to parse it first, perhaps using regular expressions. For example:
import re
class Sequence(list):
def __init__(self, s):
if isinstance(s, str):
self.extend(re.findall(r'[^\[\]]|\[\d+\]', s))
else:
list.__init__(self, s)
def __str__(self):
return ''.join(self)
def __getslice__(self, i, j):
l = list(self)
return Sequence(l[i:j])
Testing it:
In [1]: seq = Sequence('2-123[01]3-22')
It's a list inside...
In [2]: seq
Out[2]: ['2', '-', '1', '2', '3', '[01]', '3', '-', '2', '2']
But behaves like a string!
In [3]: print seq
2-123[01]3-22
In [4]: print seq[-6:]
3[01]3-22
Maybe you'll need to define some other methods to get the desired behavior.
Upvotes: 2