Find overlapping matches and submatches using regular expressions in Python

Question

I have a string of characters (a DNA sequence) with a regular expression I designed to filter out possible matches, (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG). Later I apply two filter conditions:

The sequence must be divisible by 3, len(match) % 3 == 0, and
There must be no stop codon (['AGA', 'AGG', 'TAA', 'TAG']) before the end of the string, not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]).

However, when I apply these filters, I get no matches at all (before the filters I get around ~200 matches. The way I'm searching for substrings in the full sequence is by running re.findall(regex, genome_fasta, overlapped=True), because matches could be submatches of other matches.

Is there something about regular expressions that I'm misunderstanding? To my knowledge, after the filters I should still have matches.

If there's anything else I need to add please let me know! (I'm using the regex package for Python 3.4, not the standard re package, because it has no overlap support).

EDIT 1:

Per comment: I'm looking for ORFs in the mitochondrial genome, but only considering those at least 150 nucleotides (characters) in length. Considering overlap is important because a match could include the first start codon in the string and the last stop codon in the string, but there could be another start codon in the middle. For example:

Given "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG", matches should be "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG" but also "ATAAAGCCATTTACCGTACATAGCACATTATAA", since both "TAG" and "TAA" are stop codons.

EDIT 2:

Totally, misunderstood comment, full code for method is:

typical_regex = r"%s[ATCGN]{%s,%s}%s" % (proc_start_codons, str(minimum_orf_length - 6), str(maximum_orf_length - 6), proc_stop_codons)

typical_fwd_matches = []
    if re.search(typical_regex, genome_fasta, overlapped=True):
        for match in re.findall(typical_regex, genome_fasta, overlapped=True):
            if len(match) % 3 == 0:
                if not any(substring in list(sliced(match[:-3], 3)) for substring in [x.replace("U", "T") for x in stop_codons]):
                    typical_fwd_matches.append(match)

print(typical_fwd_matches)

The typical_fwd_matches array is empty and the regex is rendered as (?:ATA|ATT)[ATCGN]{144,16563}(?:AGA|AGG|TAA|TAG) when printed to console/file.

user557597 · Accepted Answer

I think you can do it this way.
The subsets will consist of ever decreasing size of the previous matches.
That's about all you have to do.
So, it's fairly straight forward to design the regex.

The regex will only match multiples of 3 chars.
The beginning and middle are captured in group 1.
This is used for the new text value which is just the last match
minus the last 3 chars.

Regex explained:

 (                             # (1 start), Whole match minus last 3 chars
      (?: ATA | ATT )               # Starts with one of these 3 char sequence
      (?:                           # Cluster group
           [ATCGN]{3}                    # Any 3 char sequence consisting of these chars
      )+                            # End cluster, do 1 to many times
 )                             # (1 end)
 (?: AGA | AGG | TAA | TAG )   # Last 3 char sequence, one of these

Python code sample:

Demo

import re

r = re.compile(r"((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)")
text = "ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG"

m = r.search(text)
while m:
    print("Found:  " + m.group(0))
    text = m.group(1)
    m = r.search(text)

Output:

Found:  ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
Found:  ATAAAGCCATTTACCGTACATAGCACATTATAA
Found:  ATTTACCGTACATAG

Using this method, the subsets being tested are these:

ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAACTAG
ATAAAGCCATTTACCGTACATAGCACATTATAACCAACAAACCTACCCACCCTTAAC
ATAAAGCCATTTACCGTACATAGCACATTA
ATTTACCGTACA

We can benchmark the time the regex takes to match these.

Regex1:   ((?:ATA|ATT)(?:[ATCGN]{3})+)(?:AGA|AGG|TAA|TAG)
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   3
Elapsed Time:    1.63 s,   1627.59 ms,   1627594 µs
Matches per sec:   92,160

Find overlapping matches and submatches using regular expressions in Python

Answers (1)

Related Questions