Reputation: 33
I'm using Python to try to pull data from this old code, and the content of interest is not between neat HTML tags but rather between strings of characters including punctuation and letters. Rather than getting each piece of content though I'm getting everything between the first instance of the initial string and the last instance of the final bounding string. For example:
>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'
>>> start1 = '"text:"'
>>> end1 = '",body'
>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2
I'm instead looking to get out each instance of content bounded by start1 and end1, i.e.:
content_of_interest_1, content_of_interest_2
How can I re-phrase my code to get each instance of string-bounded content rather than all bounded content as above?
Upvotes: 3
Views: 41
Reputation: 61225
You can use regular expression with positive lookehind
import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']
Upvotes: 1
Reputation: 107287
You need to use q.find
to end1
instead of rfind
for first sub-string and rfind
for last one:
>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'
But using find
will give you just the index of first occurrence of start
and end
. So as a more proper way fro such tasks you can simply use regular expression :
>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']
Upvotes: 1