Reputation: 43
I need to match and apture the information between the pairs of tags. There are 2 pairs of tags per line. A pair of tags is like this:
<a> </a> <b>hello hello 123</b> stuff to ignore here <i>123412bhje</i> <a>what???</a> stuff to ignore here <b>asd13asf</b> <i>who! Hooooo!</i> stuff to ignore here <i>df7887a</i>
The expected output is:
hello hello 123 123412bhje
what??? asd13asf
who! Hooooo! df7887a
I need to specifically use the format:
M = re.findall(“”, linein)
Upvotes: 0
Views: 77
Reputation: 4105
In order to ignore the first <a> </a>
tag, the regex had to make the assumption that the first character inside of the tag did not contain a space, but the space was allowed thereafter.
Here are the other assumptions made:
<b> </b> <i> </i>
uppercase letters
, lowercase letters
, numbers
, and the symbols ! and ?
. If there are other symbols within the tags, then it may not match accurately.Here is a working version based on your example:
import re
linein = '<a> </a> <b>hello hello 123</b> stuff to ignore here <i>123412bhje</i> <a>what???</a> stuff to ignore here <b>asd13asf</b> <i>who! Hooooo!</i> stuff to ignore here <i>df7887a</i>'
M = re.findall(r'<[a-z]+>([A-Za-z0-9?!][[A-Za-z0-9?!\s]*)</[a-z]>', linein)
for i in range(0,len(M),2):
print(M[i],M[i+1])
OUTPUT:
hello hello 123 123412bhje
what??? asd13asf
who! Hooooo! df7887a
Upvotes: 1