Luca
Luca

Reputation: 10996

Get the full substring matching the regex pattern

I am not good with regular expressions and was looking at some online resources for what I would like to do. So basically, I have a regular expression in Python as follows:

import re
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

This is supposed to find all sub strings which begin with ATG and end in TAG or TGA or TAA. I use it as:

str = "ATGCCCTAG"
print pattern.findall(str)

However, this returns ATGCCC and removes the trailing TAG and I would like it to keep the trailing TAG. How can I change it to give me the full substring?

Upvotes: 1

Views: 283

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89567

You seems to not well understand what a lookahead is. A lookahead is a zero-width assertion and means "the current position in the string is followed by", in other words, it matches nothing since it's only a test. Consequence, the content tested in the second lookahead, will not be a part of the capture group 1 even if you put it inside. Note that re.findall returns a non-empty result only because it returns the capture groups content.

If you want to include it in the capture group 1, remove the second lookahead and put the end in the capture group:

(?=(ATG(?:...)*?(?:TAG|TGA|TAA)))

The interest of putting a whole pattern in a lookahead and a capture group is to get overlapping results. For example ATGCCCATGCCCTAG will return ATGCCCATGCCCTAG and ATGCCCTAG.

If you remove it, you will only obtain ATGCCCATGCCCTAG

Upvotes: 2

Pedro Lobito
Pedro Lobito

Reputation: 98921

You may want to use a simpler regex without lookahead, i.e.:

re.compile("ATG(?:...).*?(?:TAG|TGA|TAA)")

DEMO:

https://regex101.com/r/qI4fV0/3


EXPLANATION:

ATG(?:...).*?(?:TAG|TGA|TAA)

ATG matches the characters ATG literally (case sensitive)
(?:...) Non-capturing group
    . matches any character (except newline)
    . matches any character (except newline)
    . matches any character (except newline)
.*? matches any character (except newline)
    Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
(?:TAG|TGA|TAA) Non-capturing group
    1st Alternative: TAG
        TAG matches the characters TAG literally (case sensitive)
    2nd Alternative: TGA
        TGA matches the characters TGA literally (case sensitive)
    3rd Alternative: TAA
        TAA matches the characters TAA literally (case sensitive)

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

To find all sub strings which begin with ATG and end in TAG or TGA or TAA

You will need a

ATG(?:...)*?(?:TAG|TGA|TAA)

This regex also makes sure there are 0 or more 3-symbol (excl. newline) sequences in-between ATG and the last TAG, TGA or TAA.

See regex demo

Python demo:

import re
p = re.compile(r'ATG(?:...)*?(?:TAG|TGA|TAA)')
test_str = "FFG FFG ATGCCCTAG"
print (p.findall(test_str))

This will work if you need to find non-overlapping substrings. To find overlapping ones, the technique is to encapsulate that into a capturing group and place in a non-anchored positive look-ahead:

r'(?=(ATG(?:...)*?(?:TAG|TGA|TAA)))'
  |  |                           ||
  |  | --- Capture group ------- ||   
  | -- Positive look-ahead ------ |

See regex demo

Upvotes: 3

Related Questions