Using regular expression to parse limited HTML/XML but with embedded tags

Question

I have the following line in HTML/XML:

some text here

And I use the following regex to capture the content within the 'p' tags:

]*>([^<]*)

It worked until today, when the following HTML/XML had some embedded 'i' and 'b' tags within the 'p' tags, like in this sample:

some bold and italic text here

How to modify the regex to get the content within the 'p' tags in this last sample, including the 'b' and 'i' tags?

Miguel Mateo · Accepted Answer

To summarize, since there is a lot of heat of the kind "this should not be done using regex", this is the solution. Original XML:

some text here

Original regex to solve it:

]*>([^<]*)

Please note the use of the ^ operator, that fails when the XML changes to:

some bold and italic text here

Hence the solution regex is:

(.*?)<\/p>

Please note the elimination of the ^ operator and the introduction of the ? operator, that is the main difference. The removal of the ^ looks for anything until an open angular bracket, the ? operator stop the regex from being greedy and stops the match on the first open angular bracket found.

Awesome no? And people keep fighting to put XML parsers for something so simple and super fast!

Using regular expression to parse limited HTML/XML but with embedded tags

Answers (2)

Related Questions