Timsen
Timsen

Reputation: 4126

Regex Strip Span tags completely

I want to strip html string for Span tags.

I have a html string :

<a href=\"http://www.dr.dk/roskilde\"><span>Roskilde</span><span>Festival</span></a>

I need to strip it down to : Roskilde Festival.

Atm, I have a regex string which should be able to find all span tags, but its failing

 System.Collections.Specialized.StringCollection sc = new System.Collections.Specialized.StringCollection();

    sc.Add(@"/<\s*\/?\s*span\s*.*?>/g");


    foreach (string s in sc)
    {
        k = System.Text.RegularExpressions.Regex.Replace(pContent, s, "", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    }
    k = System.Text.RegularExpressions.Regex.Replace(pContent, @"&nbsp;", @"&#160;");                                                              

Any Ideas?

P.S. I don't wnat to use Html Agility Pack

Upvotes: 0

Views: 2144

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77505

Regexp are not the best way to process HTML. Use a HTML parser that understands nesting, because Regexp do not understand HTML nesting.

Consider looking at inverse charsets, i.e. <whatever[^>]*>

And I guess you copied this from somewhere, but your regexp probably is not the proper C# syntax (extra / and /g). Reread a regexp in C# tutorial! Try this string:

Example /<span>/g does this tag get removed?

What you probably meant to use was:

sc.Add(@"</?span( [^>]*|/)?>");

Upvotes: 3

Related Questions