Reputation: 944
I'm using regex to retrieve text of html pages. I'm eliminate html tags using this regex:
<[^>]+>
Problem is this Regex won't work correctly on html tags like this:
<input type="button" onclick="if (a > b) do_somthing();">
This Regex will match with <input type="button" onclick="if (a >
and b) do_somthing();">
will remain.
Which regex should I use to match with this markups?
Upvotes: 0
Views: 106
Reputation: 378
You could try this:
:%s/<.\{-}[^ ]>
[^ ]>
assures match the >
not preceded by any white-space.
Upvotes: 0
Reputation: 11264
As described above read the following link why regex don't work on HTML -> Don't use regex for HTML.
As suggested in the comments use an C# HTML parser, like e.g. CsQuery.
Upvotes: 1
Reputation: 4423
The better and proper way to do achieve this is to use an HTML Parser (like agility HTML pack) to parse your HTML and use according to your requirements. Parsing HTML with REGEX is hard, error-prone.
Read more: http://www.mikesdotnetting.com/article/273/using-the-htmlagilitypack-to-parse-html-in-asp-net
Upvotes: 1