ptalebic
ptalebic

Reputation: 47

How to extract substrings from a string using regular expression

I have a string s = "ATAATGCGTGGAATTATGACCGGAATC" I would like to extract all substrings starting with ATG and ending with GGA . So the results would be ATGCGTGGA and ATGACCGGA .

This is what I have done so far but not working. Thanks for helping me in advance.


s = "ATAATGCGTGGAATTATGACCGGAATC"
x = re.findall('^ATG.+GGA$', s)
print(x)  

Upvotes: 0

Views: 53

Answers (2)

Stef
Stef

Reputation: 15525

Symbols ^ and $ refer to the beginning and end of the string, not the beginning and end of the substring.

Just remove ^ and $ from your regexp: re.findall('ATG.+GGA', s).

In addition, you might want to add ? after the +, to stop at the first found CGA rather than the last: re.findall('ATG.+?GGA', s)

Refer to Module re: regular expression syntax in the official python documentation, for more information about ^, $ and ?.

Upvotes: 1

vaizki
vaizki

Reputation: 1957

With ^ and $ you are anchoring to start and end of line, don't do that if you want to find substrings. Also by default regex is "greedy", it will match the longest possible sequence.

You need to use +? for a non-greedy (aka lazy) match that matches the shortest sequences:

x = re.findall('ATG.+?GGA', s)

Upvotes: 2

Related Questions