Reputation: 4382
I am using python regex to do some regex match.
pattern1 = re.compile('<a>(.*?)</a>[\s\S]*?<b>(.*?)</b>')
pattern2 = re.compile('<b>(.*?)</b>[\s\S]*?<a>(.*?)</a>')
items = re.findall(pattern1, line)
if items:
print items[0]
else:
items = re.findall(pattern2, line)
if items:
print items[0]
As you can see, the tag a and b sequence is not fixed (a can before or after b).
I used two patterns (try pattern 1 first, then try pattern 2) to find text between tag a and tag b, but it looks so ugly, But I do not know how to use one pattern to get same result as above code.
Thanks!
Upvotes: 0
Views: 682
Reputation: 101052
Please use a HTML parser instead (as Tomalak and Maroun Maroun already suggested). For why, Tomalak already explained that.
I'll just provide a literal solution to your problem for fun:
To combine two patterns, just use |
, like:
pattern = re.compile('<a>(.*?)</a>[\s\S]*?<b>(.*?)</b>|<b>(.*?)</b>[\s\S]*?<a>(.*?)</a>')
But now you capture 4 groups, so you have to manually check which groups you matched.
match = re.search(patternN, line)
if match.group(1, 2) != (None, None):
print match.group(1, 2)
else:
print match.group(3, 4)
Or, simpler, using a named group:
pattern = re.compile('<a>(?P<first>.*?)</a>[\s\S]*?<b>(.*?)</b>|<b>(.*?)</b>[\s\S]*?<a>(.*?)</a>')
match = re.search(pattern, line)
print match.group(1, 2) if match.group('first') else match.group(3, 4)
Upvotes: 0
Reputation: 338158
Please don't use regular expressions to parse HTML. Regular expressions can't deal with HMTL(*). There is more than one nice HTML parser for Python, use one of them.
The following example uses pyquery, a jQuery API implementation on top of lxml.
from pyquery import PyQuery as pq
html_doc = """
<body>
<a>A first</a><b>B second</b>
<p>Other stuff here</p>
<b>B first</b><a>A second</a>
</body>
"""
doc = pq(html_doc)
for item in doc("a + b, b + a").prev():
print item.text
output
A first B first
Explanation: The selector a + b
selects all <b>
directly preceded by an <a>
. .prev()
moves to the immediately previous element, i.e. the <a>
(which you seem to be interested in - but only when a <b>
follows it). b + a
does the same thing for the reverse element order.
(*) For one, regular expressions cannot handle indefinitely nested constructs, they have problems when match order is not predictable and they have no way of handling the semantic implications of HTML (character escape sequences, optionally and implicitly closed elements, lenient parsing of input that is not very strictly valid and more). They tend to break silently when the input is in a form that you did not anticipate. And, when thrown at HMTL, they tend to get so complex that they make anybody's head hurt. Don't invest your time in writing ever more sophisticated regular expressions to parse HTML, it's a losing battle. The best state you can end up in is something that kind of works but is still inferior to a parser. Invest your time in learning a parser.
Upvotes: 3
Reputation: 95958
Change it to:
re.compile('(?:<b>|<a>)(.*?)(?:</a>|</b>)[\s\S]*?(?:<a>|<b>)(.*?)(?:</a>|</b>)')
Note that this needs more attention as it matches <a>
followed by </b>
. If you want to prevent this, just catch the first group (<a>
or <b>
) and force it then, something like:
<\\\1>
this will match \
followed by the previous captured tag, which will be a
or b
.
I don't recommend using regex to parse HTML, use a parser instead.
Upvotes: 1