Mike
Mike

Reputation:

How would I use regex to parse HTML to plain text

How would I use regex to parse the following:

<b>HelloWorld</b>
<p>This is a test</p>
<a href="myUrl">Google</a>

All html tags need to be removed and the urls extracted from hyperlink tags, and the result should be:

HelloWorld
This is a test
myUrl

Upvotes: 2

Views: 395

Answers (2)

Tamas Czinege
Tamas Czinege

Reputation: 121314

I know that's not the answer you expect but you shouldn't try parsing HTML with regular expressions. HTML is way to complicated to be parsed by regexes, there are all sorts of stuff that can go wrong. It is very hard to write a regex that parses HTML reliably well, I'm not even sure if it's possible.

Use something like the Beautiful Soup or HTML Agility Pack for .NET. Or you can create your own parser with a parser generator.

Upvotes: 8

Blixt
Blixt

Reputation: 50169

You should use a parser for this. Regexes just won't do. You could use recursive regex patterns, but I don't think they're supported by the .NET regex engine.

Upvotes: 1

Related Questions