How would I use regex to parse HTML to plain text

Question

How would I use regex to parse the following:

HelloWorld
This is a test
Google

All html tags need to be removed and the urls extracted from hyperlink tags, and the result should be:

HelloWorld
This is a test
myUrl

Tamas Czinege · Accepted Answer

I know that's not the answer you expect but you shouldn't try parsing HTML with regular expressions. HTML is way to complicated to be parsed by regexes, there are all sorts of stuff that can go wrong. It is very hard to write a regex that parses HTML reliably well, I'm not even sure if it's possible.

Use something like the Beautiful Soup or HTML Agility Pack for .NET. Or you can create your own parser with a parser generator.

How would I use regex to parse HTML to plain text

Answers (2)

Related Questions