bcm
bcm

Reputation: 5500

Regex to remove body tag attributes (C#)

Anyone has a regex that can remove the attributes from a body tag

for example:

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

to return:

<body>

It would also be interesting to see an example of removing just a specific attribute, like:

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

to return:

<body bgcolor="White">

Upvotes: 0

Views: 5294

Answers (7)

dtb
dtb

Reputation: 217293

You can't parse XHTML with regex. Have a look at the HTML Agility Pack instead.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{
    body.Attributes.Remove("style");
}

Upvotes: 3

bcm
bcm

Reputation: 5500

Chunky code I've got working at the moment, will be looking at reducing this:

private static string SimpleHtmlCleanup(string html)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            //foreach(HtmlNode nodebody in doc.DocumentNode.SelectNodes("//a[@href]"))

            var bodyNodes = doc.DocumentNode.SelectNodes("//body");
            if (bodyNodes != null)
            {
                foreach (HtmlNode nodeBody in bodyNodes)
                {
                    nodeBody.Attributes.Remove("style"); 
                }
            }

            var scriptNodes = doc.DocumentNode.SelectNodes("//script");
            if (scriptNodes != null)
            {
                foreach (HtmlNode nodeScript in scriptNodes)
                {
                    nodeScript.Remove();
                }
            }

            var linkNodes = doc.DocumentNode.SelectNodes("//link");
            if (linkNodes != null)
            {
                foreach (HtmlNode nodeLink in linkNodes)
                {
                    nodeLink.Remove();
                }
            }

            var xmlNodes = doc.DocumentNode.SelectNodes("//xml");
            if (xmlNodes != null)
            {
                foreach (HtmlNode nodeXml in xmlNodes)
                {
                    nodeXml.Remove();
                }
            }

            var styleNodes = doc.DocumentNode.SelectNodes("//style");
            if (styleNodes != null)
            {
                foreach (HtmlNode nodeStyle in styleNodes)
                {
                    nodeStyle.Remove();
                }
            }

            var metaNodes = doc.DocumentNode.SelectNodes("//meta");
            if (metaNodes != null)
            {
                foreach (HtmlNode nodeMeta in metaNodes)
                {
                    nodeMeta.Remove();
                }
            }

            var result = doc.DocumentNode.OuterHtml;

            return result;
        }

Upvotes: 0

Les
Les

Reputation: 10605

string pattern = @"<body[^>]*>";
string test = @"<body bgcolor=""White"" style=""font-family:sans-serif;font-size:10pt;"">";
string result = Regex.Replace(test,pattern,"<body>",RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);
string pattern2 = @"(?<=<body[^>]*)\s*style=""[^""]*""(?=[^>]*>)";
result = Regex.Replace(test, pattern2, "", RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);

This is just in case your project requirements limit your third party options (and doesn't give you the time to reinvent a parser).

Upvotes: 0

mpen
mpen

Reputation: 282885

Three ways to do it with regexes...

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
string a1 = Regex.Replace(html, @"(?<=<body\b).*?(?=>)", "");
string a2 = Regex.Replace(html, @"<(body)\b.*?>", "<$1>");
string a3 = Regex.Replace(html, @"<(body)(\s[^>]*)?>", "<$1>");
Console.WriteLine(a1);
Console.WriteLine(a2);
Console.WriteLine(a3);

Upvotes: 2

mpen
mpen

Reputation: 282885

Here's how you'd do it in SharpQuery

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
var sq = SharpQuery.Load(html);
var body = sq.Find("body").Single();
foreach (var a in body.Attributes.ToArray())
    a.Remove();
StringWriter sw = new StringWriter();
body.OwnerDocument.Save(sw);
Console.WriteLine(sw.ToString());

Which depends on HtmlAgilityPack and is a beta product... but I wanted to prove that you could do it this way.

Upvotes: 0

t0mm13b
t0mm13b

Reputation: 34592

LittleBobbyTables comment above is correct!

Regex is not the right tool, if you read it, it's actually true, using regex for this kind of thing will strike you down with undue strain and stress as the answer clearly shown on that link that LittleBobbyTables posted, what the answerer experienced as a result of using the wrong tool for the wrong job.

Regex is NOT the duct tape for doing such things nor is the answer to everything including 42... use the right tool for the right job

However you should check out HtmlAgilityPack which will do the job for you and ultimately save you from the stress, tears and blood as a result of getting to the grips of death using regex to parse html...

Upvotes: 0

Tim
Tim

Reputation: 9172

If you're doing a quick-and-dirty shell script, and you don't plan on using this much...

s/<body [^>]*>/<body>/

but I'm going to have to agree with everyone else that a parser is a better idea. I understand that sometimes you must make do with limited resources, but if you rely on a regex here... it has a strong chance of coming back to bite you when you least expect it.

and to remove a specific attribute:

s/\(<body [^>]*\) style="[^>"]*"/\1/

That will grab "body" and any attributes up to "style", drop the "style" attribute, and spit out the rest.

Upvotes: 2

Related Questions