How to capture the actual html tag content using regex

Question

Given the following example code:

bla bla 

    beta 
    bla bla bla 
    charlie 
    bold 
    etc ...

How do I extract the content of the tag

. Please note there are an unknown number of similar tags nested inside the parent tag. A simple regex like:

(.*?)

does not work because it will return:

beta

instead of the actual contents of the tag.

The regex should somehow count the number of opening and closing div tags to determine where to stop. I am not sure this is even possible in regex hence my question.

Update: My question is not on how to extract a tags data by regex in general. My question is how to make sure all tag contents is extracted (like a html parser).

Dan Roberts · Accepted Answer

It is not possible to fully parse html with normal regex without some extensions.

Using regular expressions to parse HTML: why not?

With that said, you could parse the html yourself or use something like jSoup.

https://www.bennadel.com/blog/2358-parsing-traversing-and-mutating-html-with-coldfusion-and-jsoup.htm

How to capture the actual html tag content using regex

Answers (1)

Related Questions