Reputation: 5500
Anyone has a regex that can remove the attributes from a body tag
for example:
<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">
to return:
<body>
It would also be interesting to see an example of removing just a specific attribute, like:
<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">
to return:
<body bgcolor="White">
Upvotes: 0
Views: 5294
Reputation: 217293
You can't parse XHTML with regex. Have a look at the HTML Agility Pack instead.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{
body.Attributes.Remove("style");
}
Upvotes: 3
Reputation: 5500
Chunky code I've got working at the moment, will be looking at reducing this:
private static string SimpleHtmlCleanup(string html)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
//foreach(HtmlNode nodebody in doc.DocumentNode.SelectNodes("//a[@href]"))
var bodyNodes = doc.DocumentNode.SelectNodes("//body");
if (bodyNodes != null)
{
foreach (HtmlNode nodeBody in bodyNodes)
{
nodeBody.Attributes.Remove("style");
}
}
var scriptNodes = doc.DocumentNode.SelectNodes("//script");
if (scriptNodes != null)
{
foreach (HtmlNode nodeScript in scriptNodes)
{
nodeScript.Remove();
}
}
var linkNodes = doc.DocumentNode.SelectNodes("//link");
if (linkNodes != null)
{
foreach (HtmlNode nodeLink in linkNodes)
{
nodeLink.Remove();
}
}
var xmlNodes = doc.DocumentNode.SelectNodes("//xml");
if (xmlNodes != null)
{
foreach (HtmlNode nodeXml in xmlNodes)
{
nodeXml.Remove();
}
}
var styleNodes = doc.DocumentNode.SelectNodes("//style");
if (styleNodes != null)
{
foreach (HtmlNode nodeStyle in styleNodes)
{
nodeStyle.Remove();
}
}
var metaNodes = doc.DocumentNode.SelectNodes("//meta");
if (metaNodes != null)
{
foreach (HtmlNode nodeMeta in metaNodes)
{
nodeMeta.Remove();
}
}
var result = doc.DocumentNode.OuterHtml;
return result;
}
Upvotes: 0
Reputation: 10605
string pattern = @"<body[^>]*>";
string test = @"<body bgcolor=""White"" style=""font-family:sans-serif;font-size:10pt;"">";
string result = Regex.Replace(test,pattern,"<body>",RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);
string pattern2 = @"(?<=<body[^>]*)\s*style=""[^""]*""(?=[^>]*>)";
result = Regex.Replace(test, pattern2, "", RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);
This is just in case your project requirements limit your third party options (and doesn't give you the time to reinvent a parser).
Upvotes: 0
Reputation: 282885
Three ways to do it with regexes...
string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
string a1 = Regex.Replace(html, @"(?<=<body\b).*?(?=>)", "");
string a2 = Regex.Replace(html, @"<(body)\b.*?>", "<$1>");
string a3 = Regex.Replace(html, @"<(body)(\s[^>]*)?>", "<$1>");
Console.WriteLine(a1);
Console.WriteLine(a2);
Console.WriteLine(a3);
Upvotes: 2
Reputation: 282885
Here's how you'd do it in SharpQuery
string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
var sq = SharpQuery.Load(html);
var body = sq.Find("body").Single();
foreach (var a in body.Attributes.ToArray())
a.Remove();
StringWriter sw = new StringWriter();
body.OwnerDocument.Save(sw);
Console.WriteLine(sw.ToString());
Which depends on HtmlAgilityPack and is a beta product... but I wanted to prove that you could do it this way.
Upvotes: 0
Reputation: 34592
LittleBobbyTables comment above is correct!
Regex is not the right tool, if you read it, it's actually true, using regex for this kind of thing will strike you down with undue strain and stress as the answer clearly shown on that link that LittleBobbyTables posted, what the answerer experienced as a result of using the wrong tool for the wrong job.
Regex is NOT the duct tape for doing such things nor is the answer to everything including 42... use the right tool for the right job
However you should check out HtmlAgilityPack which will do the job for you and ultimately save you from the stress, tears and blood as a result of getting to the grips of death using regex to parse html...
Upvotes: 0
Reputation: 9172
If you're doing a quick-and-dirty shell script, and you don't plan on using this much...
s/<body [^>]*>/<body>/
but I'm going to have to agree with everyone else that a parser is a better idea. I understand that sometimes you must make do with limited resources, but if you rely on a regex here... it has a strong chance of coming back to bite you when you least expect it.
and to remove a specific attribute:
s/\(<body [^>]*\) style="[^>"]*"/\1/
That will grab "body" and any attributes up to "style", drop the "style" attribute, and spit out the rest.
Upvotes: 2