Nebu
Nebu

Reputation: 1793

How to capture the actual html tag content using regex

Given the following example code:

bla bla 
<div class="a">
    <div class="b">beta</div> 
    bla bla bla 
    <div class="c">charlie</div> 
    <b>bold</b> 
    etc ... 
</div>

How do I extract the content of the tag <div class="a">. Please note there are an unknown number of similar tags nested inside the parent tag. A simple regex like:

<div class="a">(.*?)</div> 

does not work because it will return:

<div class="b">beta

instead of the actual contents of the tag.

The regex should somehow count the number of opening and closing div tags to determine where to stop. I am not sure this is even possible in regex hence my question.

Update: My question is not on how to extract a tags data by regex in general. My question is how to make sure all tag contents is extracted (like a html parser).

Upvotes: 1

Views: 337

Answers (1)

Dan Roberts
Dan Roberts

Reputation: 4694

It is not possible to fully parse html with normal regex without some extensions.

Using regular expressions to parse HTML: why not?

With that said, you could parse the html yourself or use something like jSoup.

https://www.bennadel.com/blog/2358-parsing-traversing-and-mutating-html-with-coldfusion-and-jsoup.htm

Upvotes: 1

Related Questions