Reputation: 13487

Slicing by start and stop string values in Python

I have a string in which there are certain values that I need to extract from it. For example: "FEFEWFSTARTFFFPENDDCDC". How could I make an expression that would take a slice from "START" all the way to "END"?

I tried doing this previously by creating functions which used a for loop and string.find("START") to locate the beginning and ends, but this didn't appear to work effectively and seemed overly complex. Is there an easier way to do this without using complex loops?

EDIT:

Forgot this part. What if there were different end values? In other words, instead of just ending with "END", the values "DONE" and "NOMORE" would also end it? And in addition to that, there were multiple starts and ends throughout the string. For example: "STARTFFEFFDONEFEWFSTARTFEFFENDDDW".

EDIT2: Sample run: Start value: ATG. End values: TAG,TAA,TGA

"Enter a string": TTATGTTTTAAGGATGGGGCGTTAGTT
TTT
GGGCGT

And

"Enter a string": TGTGTGTATAT
"No string found"

Upvotes: 1

Answers (5)

Tim Pietzcker

Reputation: 336128

That's a perfect fit for a regular expression:

>>> import re
>>> s = "FEFEWFSTARTFFFPENDDCDCSTARTDOINVOIJHSDFDONEDFOIER"
>>> re.findall("START.*?(?:END|DONE|NOMORE)", s)
['STARTFFFPEND', 'STARTDOINVOIJHSDFDONE']

.* matches any number of characters (except newlines), the additional ? makes the quantifier lazy, telling it to match as few characters as possible. Otherwise, there would be only one match, namely STARTFFFPENDDCDCSTARTDOINVOIJHSDFDONE.

As @BurhanKhalid noted, if you add a capturing group, only the substring matched by that part of the regex will be captured:

>>> re.findall("START(.*?)(?:END|DONE|NOMORE)", s)
['FFFP', 'DOINVOIJHSDF']

Explanation:

START    # Match "START"
(        # Match and capture in group number 1:
 .*?     # Any character, any number of times, as few as possible
)        # End of capturing group 1
(?:      # Start a non-capturing group that matches...
 END     # "END"
|        # or
 DONE    # "DONE"
|        # or
 NOMORE  # "NOMORE"
)        # End of non-capturing group

And if your real goal is to match gene sequences, you need to make sure that you always match triplets:

re.findall("ATG(?:.{3})*?(?:TA[AG]|TGA)", s)

Upvotes: 5

Tanveer Alam

Reputation: 5275

Not that efficient but does work.

>>> s = "FEFEWFSTARTFFFPENDDCDC"
>>> s[s.index('START'):s.index('END')+len('END')]
'STARTFFFPEND'

Upvotes: 1

Giorgio Ruffa

Reputation: 466

yourString = 'FEFEWFSTARTFFFPENDDCDC'
substring = yourString[yourString.find("START") + len("START") : yourString.find("END")]

Upvotes: 1

Bartosz Marcinkowski

Reputation: 6861

The simple way (no loop, no regex):

s = "FEFEWFSTARTFFFPENDDCDC"
tmp = s[s.find("START") + len("START"):]
result = tmp[:tmp.find("END")]

Upvotes: 1

Amit

Reputation: 20456

a="FEFEWFSTARTFFFPENDDCDC"
a[a.find('START'):]


'STARTFFFPENDDCDC'

Upvotes: 1

Slicing by start and stop string values in Python

Answers (5)

Related Questions