Patrick
Patrick

Reputation: 5836

Strip all HTML tags except certain ones?

I have a requirement where I need to strip all tags out of a large block of HTML that is tag-soup, essentially stuff like:

<div style=""><span style=""><p style=""><div style=""><span style="">

etc.

I need to strip them all out, except the <p> tags, but on those I need to strip out the attributes such as style="" and just leave them as <p></p>.

I am currently stripping all tags with a regex:

public static string StripHtml(string input) => Regex.Replace(input, "<.*?>", string.Empty)

Any ideas on how to do this?

I would use a customized C# library for this but I am using .Net Core on Linux so a lot of these libraries (such as AngleSharp) that require the full framework aren't going to work for me.

Upvotes: 0

Views: 1060

Answers (1)

Tracer69
Tracer69

Reputation: 1110

<((?!p\s).)*?> will give you all tags except the paragraphs. So your program could delete all matches of this regex and replace the rest of the tags (all p's) with empty paragraph tags. (<p .*?> regex for receiving all p-tags)

Upvotes: 1

Related Questions