Miguel Mateo
Miguel Mateo

Reputation: 309

Using regular expression to parse limited HTML/XML but with embedded tags

I have the following line in HTML/XML:

<p class="myText" style="...">some text here</p>

And I use the following regex to capture the content within the 'p' tags:

<p\sclass=\"myText\"[^>]*>([^<]*)</p>

It worked until today, when the following HTML/XML had some embedded 'i' and 'b' tags within the 'p' tags, like in this sample:

<p class="myText" style="...">some <b style="...">bold</b> and <i>italic</I> text here</p>

How to modify the regex to get the content within the 'p' tags in this last sample, including the 'b' and 'i' tags?

Upvotes: 0

Views: 227

Answers (2)

Miguel Mateo
Miguel Mateo

Reputation: 309

To summarize, since there is a lot of heat of the kind "this should not be done using regex", this is the solution. Original XML:

<p class="myText" style="...">some text here</p>

Original regex to solve it:

<p\sclass=\"myText\"[^>]*>([^<]*)</p>

Please note the use of the ^ operator, that fails when the XML changes to:

<p class="myText" style="...">some <b style="...">bold</b> and <i>italic</i> text here</p>

Hence the solution regex is:

<p\sclass=\"myText\".+?>(.*?)<\/p>

Please note the elimination of the ^ operator and the introduction of the ? operator, that is the main difference. The removal of the ^ looks for anything until an open angular bracket, the ? operator stop the regex from being greedy and stops the match on the first open angular bracket found.

Awesome no? And people keep fighting to put XML parsers for something so simple and super fast!

Upvotes: 1

yms
yms

Reputation: 10418

Use lazy mode to look for the first instance of '>' in your string:

<p.+?>(.*)<\/p>

Test it here: https://regex101.com/r/Lz7GT0/1

If you want to process more than one match inside the same string, all you need to do is use a stateful parser and call match multiple times.

Try it out here: http://jsfiddle.net/jarn851m/

Upvotes: 3

Related Questions