Reputation: 47
I have a string s = "ATAATGCGTGGAATTATGACCGGAATC"
I would like to extract all substrings starting with ATG
and ending with GGA
. So the results would be ATGCGTGGA
and ATGACCGGA
.
This is what I have done so far but not working. Thanks for helping me in advance.
s = "ATAATGCGTGGAATTATGACCGGAATC"
x = re.findall('^ATG.+GGA$', s)
print(x)
Upvotes: 0
Views: 53
Reputation: 15525
Symbols ^
and $
refer to the beginning and end of the string, not the beginning and end of the substring.
Just remove ^
and $
from your regexp: re.findall('ATG.+GGA', s)
.
In addition, you might want to add ?
after the +
, to stop at the first found CGA
rather than the last: re.findall('ATG.+?GGA', s)
Refer to Module re
: regular expression syntax in the official python documentation, for more information about ^
, $
and ?
.
Upvotes: 1
Reputation: 1957
With ^
and $
you are anchoring to start and end of line, don't do that if you want to find substrings. Also by default regex is "greedy", it will match the longest possible sequence.
You need to use +?
for a non-greedy (aka lazy) match that matches the shortest sequences:
x = re.findall('ATG.+?GGA', s)
Upvotes: 2