Reputation: 1398
How to clean HTML fromany special tag via Regex in C#?
Here is a sample HTML where Ineed to delete <font size="-2">
R&usg=AFQjCNFYiDC6u3xOGn4JpO-GF83PjdSbtw&url=http://online.wsj.com/article/SB10000872396390444426404577647060576633348.html"><img src="//nt2.ggpht.com/news/tbn/bm6jvTMtF-PpnM/6.jpg" alt="" border="1" width="80" height="80" /><br /><font size="-2">Wall Street Journal</font></a></font>
</td>
I know we have to use somehow Regex, but I cannot figure out how we can use it.
I have tried to adjust this method but it cleans ALL tags.
public string Strip(string text)
{
return Regex.Replace(text, @”<(.|\n)*?>”, string.Empty);
}
In fact I am looking to some approach to do like this
public string Strip(string text, HTMLTags.Font)
{
}
where HTMLTags.Font
is a enum of some of the HTML tags
enum HTMLTags
{
Font,
Div,
Td
...
}
Thank you for any clue!!!
Upvotes: 0
Views: 711
Reputation: 35407
While HTMLAgilityPack is, most probably, the best option (as it allows you to perform LINQ queries and/or XPath queries against a DOM like representation of the HTML). A start could be the following:
public static class HTMLTags
{
public string Font { get { return "<font>" } }
public string Div { get { return "<div>" } }
public string Td { get { return "<td>" } }
}
then, in your client code:
public string Strip(string text, HTMLTags.Font)
{
/* string parse/replace occurances of HTMLTags.Font */
}
Upvotes: 1
Reputation: 24556
The best for this should be Html Agility Pack.
It's not a regex but the parser is very tolerant with "real world" malformed HTML.
Upvotes: 2
Reputation: 116178
use HtmlAgilityPack to parse html
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
foreach (var font in doc.DocumentNode.Descendants("font").ToArray())
{
font.Remove();
}
Upvotes: 3