Reputation: 53
I'm trying to separate every single tag from a HTML source that I'm receiving.
It's a huge piece of code that I'm receiving and I'm trying to make it more 'readable', for a human analysis. This far I only made up to this RegEx code:
RegEx(<\w*>.*<\/\w*>)
But this get the beggining of the !DOCTYPE
tag and run 'till the </html>
.
And what I'm trying to do is to select each tag individually, independent of the type.
Also, I'm running this RegEx with JavaScript.
Any suggestions are very welcome :)
Upvotes: 0
Views: 46
Reputation: 943563
Solving the immediate problem is trivial. You need to make your wildcards lazy instead of greedy.
i.e. you want to change *
(match all you can that matches the previous thing) to *?
(match as little as you can that matches the previous thing but still lets you match the next thing)
… but then your code will break if there is a >
inside an attribute value, or a script element, or a style element, etc.
Parsing HTML is not trivial. Regular expressions are not a good tool for it. Use an existing library instead.
Upvotes: 2