CSAT
CSAT

Reputation: 207

Remove style from HTML Tags using Regex C#

I want to remove style from HTML Tags using C#. It should return only HTML Simple Tags.

For i.e. if String = <p style="margin: 15px 0px; padding: 0px; border: 0px; outline: 0px;">Hello</p> Then it should return String = <p>Hello</p>

Like that for all HTML Tags, <strong></string>, <b></b> etc. etc.

Please help me for this.

Upvotes: 5

Views: 11221

Answers (5)

Eyad
Eyad

Reputation: 201

All the answers are fine but it can also be done simply by using this method: "Your HTML String".replace("style", "data-tags"); You can also replace "class" the same way.

Upvotes: 0

Ashish Srivastava
Ashish Srivastava

Reputation: 16

   source = Regex.Replace(source, "(<style.+?</style>)|(<script.+?</script>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "(<img.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "(<o:.+?</o:.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "<!--.+?-->", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source, "class=.+?>", ">", RegexOptions.IgnoreCase | RegexOptions.Singleline);
   source = Regex.Replace(source.Replace(System.Environment.NewLine, "<br/>"), "<[^(a|img|b|i|u|ul|ol|li)][^>]*>", " ");

Upvotes: -1

ZooZ
ZooZ

Reputation: 971

I usually use the below code to remove inline styles, class, images and comments from an Outlook message prior to saving it into database:

    desc = Regex.Replace(desc, "(<style.+?</style>)|(<script.+?</script>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "(<img.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "(<o:.+?</o:.+?>)", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "<!--.+?-->", "", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "class=.+?>", ">", RegexOptions.IgnoreCase | RegexOptions.Singleline);
    desc = Regex.Replace(desc, "class=.+?\s", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline);

Upvotes: 0

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51330

First, as others suggest, an approach using a proper HTML parser is much better. Either use HtmlAgilityPack or CsQuery.

If you really want a regex solution, here it is:

Replace this pattern: (<.+?)\s+style\s*=\s*(["']).*?\2(.*?>)
With: $1$3

Demo: http://regex101.com/r/qJ1vM1/1


To remove multiple attributes, since you're using .NET, this should work:

Replace (?<=<[^<>]+)\s+(?:style|class)\s*=\s*(["']).*?\1
With an empty string

Upvotes: 10

Noctis
Noctis

Reputation: 11763

As others said, You can use HTML Agility pack, which has this nice tool: HTML Agility Pack test which shows you what you're doing.

Other than that, it's regex, which is not recommended with HTML usually, or simply running on your code with a loop on all chars. If it starts with < read until whitespace, and then remove all the chars up until >. That should take care of most basic cases, but you'll have to test it.

Here's a little snippet that will do it:

void Main()
{
    // your input
    String input = @"<p style=""margin: 15px 0px; padding: 0px; border: 0px; outline: 0px;"">Hello</p>";
    // temp variables
    StringBuilder sb = new StringBuilder();
    bool inside = false;
    bool delete = false;
    // analyze string
    for (int i = 0; i < input.Length; i++)
    {
        // Special case, start bracket
        if (input[i].Equals('<')) { 
            inside = true;
            delete = false;
        }
        // special case, close bracket
        else if (input[i].Equals('>')) {
            inside = false;
            delete = false;
        }
        // other letters
        else if (inside) {
            // Once you have a space, ignore the rest until closing bracket
            if (input[i].Equals(' '))
                delete = true;
        }   
        // add if needed
        if (!delete)
                sb.Append(input[i]);
    }
    var result = sb.ToString(); // -> holds: "<p>Hello</p>"
}

Upvotes: 0

Related Questions