Reputation: 73
I'm working with a file, that is a Genbank entry (similar to this)
My goal is to extract the numbers in the CDS line, e.g.:
CDS join(1200..1401,3490..4302)
but my regex should also be able to extract the numbers from multiple lines, like this:
CDS join(1200..1401,1550..1613,1900..2010,2200..2250, 2300..2660,2800..2999,3100..3333)
I'm using this regular expression:
import re
match=re.compile('\w+\D+\W*(\d+)\D*')
result=match.findall(line)
print(result)
This gives me the correct numbers but also numbers from the rest of the file, like
gene complement(3300..4037)
so how can I change my regex to get the numbers? I should only use regex on it..
I'm going to use the numbers to print the coding part of the base sequence.
Upvotes: 0
Views: 351
Reputation: 3232
The following re
pattern might work:
>>> match = re.compile(\s+CDS\s+\w+\([^\)]*\))
But you'll need to call findall
on the whole text body, not just a line at a time.
You can use parentheses just to grab out the numbers:
>>> match = re.compile(\s+CDS\s+\w+\(([^\)]*)\))
>>> match.findall(stuff)
1200..1401,3490..4302 # Numbers only
Let me know if that achieves what you want!
Upvotes: 0
Reputation: 43169
You could use the heavily improved regex
module by Matthew Barnett (which provides the \G
functionality). With this, you could come up with the following code:
import regex as re
rx = re.compile("""
(?:
CDS\s+join\( # look for CDS, followed by whitespace and join(
| # OR
(?!\A)\G # make sure it's not the start of the string and \G
[.,\s]+ # followed by ., or whitespace
)
(\d+) # capture these digits
""", re.VERBOSE)
string = """
CDS join(1200..1401,1550..1613,1900..2010,2200..2250,
2300..2660,2800..2999,3100..3333)
"""
numbers = rx.findall(string)
print numbers
# ['1200', '1401', '1550', '1613', '1900', '2010', '2200', '2250', '2300', '2660', '2800', '2999', '3100', '3333']
\G
makes sure the regex engine looks for the next match at the end of the last match.
See a demo on regex101.com (in PHP
as the emulator does not provide the same functionality for Python
[it uses the original re
module]).
A far inferior solution (if you are only allowed to use the re
module), would be to use lookarounds:
(?<=[(.,\s])(\d+)(?=[,.)])
(?<=)
is a positive lookbehind, while (?=)
is a positive lookahead, see a demo for this approach on regex101.com. Be aware though there might be a couple of false positives.
Upvotes: 1