marcusstarnes
marcusstarnes

Reputation: 6531

Remove Encoded HTML from Strings using RegEx

I currently have an extension method from removing any HTML from strings.

Regex.Replace(s, @"<(.|\n)*?>", string.Empty);

This works fine on the whole, however, I am occasionally getting passed strings that have both standard HTML markup within them, along with encoded markup (I don't have control of the source data so can't correct things at the point of entry), e.g.

&lt;p&gt;<p>Sample text</p>&lt;/p&gt;

I need an expression that will remove both encoded and non-encoded HTML (whether it be paragraph tags, anchor tags, formatting tags etc.) from a string.

Upvotes: 2

Views: 4731

Answers (1)

Shekhar_Pro
Shekhar_Pro

Reputation: 18430

I think you can do that in two passes with your same Extension method.

First Replace the usual un-encoded tags then Decode the returned string and do it again. Simple

Upvotes: 5

Related Questions