Reputation:
How would I use regex to parse the following:
<b>HelloWorld</b>
<p>This is a test</p>
<a href="myUrl">Google</a>
All html tags need to be removed and the urls extracted from hyperlink tags, and the result should be:
HelloWorld This is a test myUrl
Upvotes: 2
Views: 395
Reputation: 121314
I know that's not the answer you expect but you shouldn't try parsing HTML with regular expressions. HTML is way to complicated to be parsed by regexes, there are all sorts of stuff that can go wrong. It is very hard to write a regex that parses HTML reliably well, I'm not even sure if it's possible.
Use something like the Beautiful Soup or HTML Agility Pack for .NET. Or you can create your own parser with a parser generator.
Upvotes: 8
Reputation: 50169
You should use a parser for this. Regexes just won't do. You could use recursive regex patterns, but I don't think they're supported by the .NET regex engine.
Upvotes: 1