Reputation: 13487
I have a string in which there are certain values that I need to extract from it. For example: "FEFEWFSTARTFFFPENDDCDC"
. How could I make an expression that would take a slice from "START"
all the way to "END"
?
I tried doing this previously by creating functions which used a for
loop and string.find("START")
to locate the beginning and ends, but this didn't appear to work effectively and seemed overly complex. Is there an easier way to do this without using complex loops?
EDIT:
Forgot this part. What if there were different end values? In other words, instead of just ending with "END"
, the values "DONE"
and "NOMORE"
would also end it? And in addition to that, there were multiple starts and ends throughout the string. For example: "STARTFFEFFDONEFEWFSTARTFEFFENDDDW"
.
EDIT2: Sample run: Start value: ATG. End values: TAG,TAA,TGA
"Enter a string": TTATGTTTTAAGGATGGGGCGTTAGTT
TTT
GGGCGT
And
"Enter a string": TGTGTGTATAT
"No string found"
Upvotes: 1
Views: 2519
Reputation: 336128
That's a perfect fit for a regular expression:
>>> import re
>>> s = "FEFEWFSTARTFFFPENDDCDCSTARTDOINVOIJHSDFDONEDFOIER"
>>> re.findall("START.*?(?:END|DONE|NOMORE)", s)
['STARTFFFPEND', 'STARTDOINVOIJHSDFDONE']
.*
matches any number of characters (except newlines), the additional ?
makes the quantifier lazy, telling it to match as few characters as possible. Otherwise, there would be only one match, namely STARTFFFPENDDCDCSTARTDOINVOIJHSDFDONE
.
As @BurhanKhalid noted, if you add a capturing group, only the substring matched by that part of the regex will be captured:
>>> re.findall("START(.*?)(?:END|DONE|NOMORE)", s)
['FFFP', 'DOINVOIJHSDF']
Explanation:
START # Match "START"
( # Match and capture in group number 1:
.*? # Any character, any number of times, as few as possible
) # End of capturing group 1
(?: # Start a non-capturing group that matches...
END # "END"
| # or
DONE # "DONE"
| # or
NOMORE # "NOMORE"
) # End of non-capturing group
And if your real goal is to match gene sequences, you need to make sure that you always match triplets:
re.findall("ATG(?:.{3})*?(?:TA[AG]|TGA)", s)
Upvotes: 5
Reputation: 5275
Not that efficient but does work.
>>> s = "FEFEWFSTARTFFFPENDDCDC"
>>> s[s.index('START'):s.index('END')+len('END')]
'STARTFFFPEND'
Upvotes: 1
Reputation: 466
yourString = 'FEFEWFSTARTFFFPENDDCDC'
substring = yourString[yourString.find("START") + len("START") : yourString.find("END")]
Upvotes: 1
Reputation: 6861
The simple way (no loop, no regex):
s = "FEFEWFSTARTFFFPENDDCDC"
tmp = s[s.find("START") + len("START"):]
result = tmp[:tmp.find("END")]
Upvotes: 1