Gene S
Gene S

Reputation: 2793

Parsing quote marks out of html style attribute

I need to find a way to remove the single quote mark surrounding the fonts names in the font-family style.

So this...

<span style="font-family: 'Verdana'; color: #0000ff; font-size: 10pt;"><span style="font-family: 'Arial';"><strong>2.0: Scope</strong></span></span>

would change to this...

<span style="font-family: Verdana; color: #0000ff; font-size: 10pt;"><span style="font-family: Arial;"><strong>2.0: Scope</strong></span></span>

I only care about style attributes that are surrounded by double quotes. If they are surrounded by a single quote then I know the font names will not be surrounded by a single quote.

I have to do this within C# because the application processing this Html is running as a Windows Service.

I know that normally it is a No-No to use regular expressions to parse Html but I was hoping this could be an exception since I am looking for a very specific case. I do have access to a Html parser but very slow compared to a regular expression.

Here is the best I could come up with...

var html = "<span style=\"font-family: 'Verdana'; color: #0000ff; font-size: 10pt;\"><span style=\"font-family: 'Arial';\"><strong>2.0: Scope</strong></span></span>";
var newHtml = Regex.Replace(html, "style(.*)=(.*)\"(.*)font-family:(.*?)[\">]", m => m.Value.Replace("'", ""));

which achieves the correct goal but does not really find the correct matches. It matches this...

style="font-family: Verdana; color: #0000ff; font-size: 10pt;"><span style="font-family: Arial;"

what I want to do is find two matches like this...

style="font-family: 'Verdana'; color: #0000ff; font-size: 10pt;"
style="font-family: 'Arial';"

and being a novice regular expression guy I cannot seem to find the right combination.

or to be more specificy, I need a way to find a value within a font-family that is surrounded by single quotes, and then remove the single quotes from that value.

Can someone help me come up with the appropriate regular expression?
Is there an alternative to regular expression that would work better in this scenario?

Upvotes: 1

Views: 984

Answers (3)

Gene S
Gene S

Reputation: 2793

Here's how I resolved it...

var html = "<span style=\"font-family: 'Verdana'; color: #0000ff; font-size: 10pt;\"><span style=\"font-family: 'Arial';\"><strong>2.0: Scope</strong></span></span>"; 
var newHtml = Regex.Replace(html, "style\\s*=\\s*\"[^\"]*\\bfont-family:.*?'.*?(;|\")", m => m.Value.Replace("'", "");

Thanks Lou for guiding me in the right direction.

Upvotes: 1

krillgar
krillgar

Reputation: 12815

Pass each string that you get into this function:

private static string RemoveSingleQuote(string psHTML) {
    // If it starts with the single quote after "style=" then, just return the string.
    if (psHTML.StartsWith("<span style=\'")) return (psHTML);

    StringBuilder sb = new StringBuilder();

    foreach (char c in psHTML) {
        if (c != '\'') {
            sb.Append(c);
        }
    }

    return (sb.ToString());
}

Upvotes: 0

Lou Franco
Lou Franco

Reputation: 89222

This is happening because regular expression matching is greedy -- it will try to match the longest string that matches.

var newHtml = Regex.Replace(html, "style(.*)=(.*)\"(.*)font-family:(.*?)[\">]", m => m.Value.Replace("'", ""));

Your problem is the (.*?) after font-family -- it will keep going even past close tags. A simple fix is

var newHtml = Regex.Replace(html, "style(.*)=(.*)\"(.*)font-family:([^>]*?)[\">]", m => m.Value.Replace("'", ""));

The ^ in [^>] means not these characters.

Of course, these are all hacks -- there is definitely real HTML where this won't work.

Upvotes: 1

Related Questions