Reputation: 5836
I have a requirement where I need to strip all tags out of a large block of HTML that is tag-soup, essentially stuff like:
<div style=""><span style=""><p style=""><div style=""><span style="">
etc.
I need to strip them all out, except the <p>
tags, but on those I need to strip out the attributes such as style=""
and just leave them as <p></p>
.
I am currently stripping all tags with a regex:
public static string StripHtml(string input) => Regex.Replace(input, "<.*?>", string.Empty)
Any ideas on how to do this?
I would use a customized C# library for this but I am using .Net Core on Linux so a lot of these libraries (such as AngleSharp) that require the full framework aren't going to work for me.
Upvotes: 0
Views: 1060
Reputation: 1110
<((?!p\s).)*?>
will give you all tags except the paragraphs. So your program could delete all matches of this regex and replace the rest of the tags (all p's) with empty paragraph tags. (<p .*?>
regex for receiving all p-tags)
Upvotes: 1