Ernanni
Ernanni

Reputation: 53

Use RegEx to breakdown an html source

I'm trying to separate every single tag from a HTML source that I'm receiving.

It's a huge piece of code that I'm receiving and I'm trying to make it more 'readable', for a human analysis. This far I only made up to this RegEx code:

RegEx(<\w*>.*<\/\w*>)

But this get the beggining of the !DOCTYPE tag and run 'till the </html>.

And what I'm trying to do is to select each tag individually, independent of the type.
Also, I'm running this RegEx with JavaScript.

Any suggestions are very welcome :)

Upvotes: 0

Views: 46

Answers (1)

Quentin
Quentin

Reputation: 943563

Solving the immediate problem is trivial. You need to make your wildcards lazy instead of greedy.

i.e. you want to change * (match all you can that matches the previous thing) to *? (match as little as you can that matches the previous thing but still lets you match the next thing)

… but then your code will break if there is a > inside an attribute value, or a script element, or a style element, etc.

Parsing HTML is not trivial. Regular expressions are not a good tool for it. Use an existing library instead.

Upvotes: 2

Related Questions