Ali Gonabadi
Ali Gonabadi

Reputation: 944

Regex for removing complex html tags

I'm using regex to retrieve text of html pages. I'm eliminate html tags using this regex:

<[^>]+>

Problem is this Regex won't work correctly on html tags like this:

<input type="button" onclick="if (a > b) do_somthing();">

This Regex will match with <input type="button" onclick="if (a > and b) do_somthing();"> will remain.

Which regex should I use to match with this markups?

Upvotes: 0

Views: 106

Answers (3)

Alan G&#243;mez
Alan G&#243;mez

Reputation: 378

You could try this:

:%s/<.\{-}[^ ]>

[^ ]> assures match the > not preceded by any white-space.

Upvotes: 0

Moerwald
Moerwald

Reputation: 11264

As described above read the following link why regex don't work on HTML -> Don't use regex for HTML.

As suggested in the comments use an C# HTML parser, like e.g. CsQuery.

Upvotes: 1

M.S.
M.S.

Reputation: 4423

The better and proper way to do achieve this is to use an HTML Parser (like agility HTML pack) to parse your HTML and use according to your requirements. Parsing HTML with REGEX is hard, error-prone.

Read more: http://www.mikesdotnetting.com/article/273/using-the-htmlagilitypack-to-parse-html-in-asp-net

Upvotes: 1

Related Questions