Reputation: 10996
I am not good with regular expressions and was looking at some online resources for what I would like to do. So basically, I have a regular expression in Python as follows:
import re
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')
This is supposed to find all sub strings which begin with ATG and end in TAG or TGA or TAA. I use it as:
str = "ATGCCCTAG"
print pattern.findall(str)
However, this returns ATGCCC
and removes the trailing TAG
and I would like it to keep the trailing TAG
. How can I change it to give me the full substring?
Upvotes: 1
Views: 283
Reputation: 89567
You seems to not well understand what a lookahead is. A lookahead is a zero-width assertion and means "the current position in the string is followed by", in other words, it matches nothing since it's only a test. Consequence, the content tested in the second lookahead, will not be a part of the capture group 1 even if you put it inside. Note that re.findall
returns a non-empty result only because it returns the capture groups content.
If you want to include it in the capture group 1, remove the second lookahead and put the end in the capture group:
(?=(ATG(?:...)*?(?:TAG|TGA|TAA)))
The interest of putting a whole pattern in a lookahead and a capture group is to get overlapping results. For example ATGCCCATGCCCTAG
will return ATGCCCATGCCCTAG
and ATGCCCTAG
.
If you remove it, you will only obtain ATGCCCATGCCCTAG
Upvotes: 2
Reputation: 98921
You may want to use a simpler regex without lookahead, i.e.:
re.compile("ATG(?:...).*?(?:TAG|TGA|TAA)")
DEMO:
https://regex101.com/r/qI4fV0/3
EXPLANATION:
ATG(?:...).*?(?:TAG|TGA|TAA)
ATG matches the characters ATG literally (case sensitive)
(?:...) Non-capturing group
. matches any character (except newline)
. matches any character (except newline)
. matches any character (except newline)
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
(?:TAG|TGA|TAA) Non-capturing group
1st Alternative: TAG
TAG matches the characters TAG literally (case sensitive)
2nd Alternative: TGA
TGA matches the characters TGA literally (case sensitive)
3rd Alternative: TAA
TAA matches the characters TAA literally (case sensitive)
Upvotes: 2
Reputation: 626926
To find all sub strings which begin with ATG and end in TAG or TGA or TAA
You will need a
ATG(?:...)*?(?:TAG|TGA|TAA)
This regex also makes sure there are 0 or more 3-symbol (excl. newline) sequences in-between ATG
and the last TAG
, TGA
or TAA
.
See regex demo
import re
p = re.compile(r'ATG(?:...)*?(?:TAG|TGA|TAA)')
test_str = "FFG FFG ATGCCCTAG"
print (p.findall(test_str))
This will work if you need to find non-overlapping substrings. To find overlapping ones, the technique is to encapsulate that into a capturing group and place in a non-anchored positive look-ahead:
r'(?=(ATG(?:...)*?(?:TAG|TGA|TAA)))'
| | ||
| | --- Capture group ------- ||
| -- Positive look-ahead ------ |
See regex demo
Upvotes: 3