kill3rTcell
kill3rTcell

Reputation: 33

Searching for content between non-tag strings

I'm using Python to try to pull data from this old code, and the content of interest is not between neat HTML tags but rather between strings of characters including punctuation and letters. Rather than getting each piece of content though I'm getting everything between the first instance of the initial string and the last instance of the final bounding string. For example:

>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'

>>> start1 = '"text:"'

>>> end1 = '",body'

>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2

I'm instead looking to get out each instance of content bounded by start1 and end1, i.e.:

content_of_interest_1, content_of_interest_2

How can I re-phrase my code to get each instance of string-bounded content rather than all bounded content as above?

Upvotes: 3

Views: 41

Answers (2)

Sede
Sede

Reputation: 61225

You can use regular expression with positive lookehind

import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']

Upvotes: 1

Kasravnd
Kasravnd

Reputation: 107287

You need to use q.find to end1 instead of rfind for first sub-string and rfind for last one:

>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'

But using find will give you just the index of first occurrence of start and end. So as a more proper way fro such tasks you can simply use regular expression :

>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']

Upvotes: 1

Related Questions