Gliz
Gliz

Reputation: 73

Python: Regular Expressions on getting repeating set of numbers

I'm working with a file, that is a Genbank entry (similar to this)

My goal is to extract the numbers in the CDS line, e.g.:

    CDS             join(1200..1401,3490..4302)

but my regex should also be able to extract the numbers from multiple lines, like this:

     CDS            join(1200..1401,1550..1613,1900..2010,2200..2250,
                 2300..2660,2800..2999,3100..3333)

I'm using this regular expression:

     import re
     match=re.compile('\w+\D+\W*(\d+)\D*')
     result=match.findall(line)
     print(result)

This gives me the correct numbers but also numbers from the rest of the file, like

 gene            complement(3300..4037)

so how can I change my regex to get the numbers? I should only use regex on it..

I'm going to use the numbers to print the coding part of the base sequence.

Upvotes: 0

Views: 351

Answers (2)

DaveBensonPhillips
DaveBensonPhillips

Reputation: 3232

The following re pattern might work:

>>> match = re.compile(\s+CDS\s+\w+\([^\)]*\))

But you'll need to call findall on the whole text body, not just a line at a time.

You can use parentheses just to grab out the numbers:

>>> match = re.compile(\s+CDS\s+\w+\(([^\)]*)\))
>>> match.findall(stuff)
1200..1401,3490..4302       # Numbers only

Let me know if that achieves what you want!

Upvotes: 0

Jan
Jan

Reputation: 43169

You could use the heavily improved regex module by Matthew Barnett (which provides the \G functionality). With this, you could come up with the following code:

import regex as re
rx = re.compile("""
            (?:
                CDS\s+join\(    # look for CDS, followed by whitespace and join(
                |               # OR
                (?!\A)\G        # make sure it's not the start of the string and \G 
                [.,\s]+         # followed by ., or whitespace
            )
            (\d+)               # capture these digits
                """, re.VERBOSE)

string = """
         CDS            join(1200..1401,1550..1613,1900..2010,2200..2250,
                     2300..2660,2800..2999,3100..3333)
"""

numbers = rx.findall(string)
print numbers
# ['1200', '1401', '1550', '1613', '1900', '2010', '2200', '2250', '2300', '2660', '2800', '2999', '3100', '3333']

\G makes sure the regex engine looks for the next match at the end of the last match.
See a demo on regex101.com (in PHP as the emulator does not provide the same functionality for Python [it uses the original re module]).

A far inferior solution (if you are only allowed to use the re module), would be to use lookarounds:

(?<=[(.,\s])(\d+)(?=[,.)])

(?<=) is a positive lookbehind, while (?=) is a positive lookahead, see a demo for this approach on regex101.com. Be aware though there might be a couple of false positives.

Upvotes: 1

Related Questions