Reputation: 1831

C# - Processing html tag attributes

I'm getting some html data from remote server and before displaying it in the UI of application i need to make some changes, i.e. delete counters, replace links, etc. Removing some tag with contents and changing specific link is not a big deal, but when it comes to some advanced processing, i have some problems.There is a need to replace (delete) few html tag attributes (not a tag itself - there are plenty of examples over internet about this). For example : delete all onmouseover handlers from buttons. I know that XPath would be a perfect fit for such problem, but i don't know it at all and although my information is XHTML-complaint, it's stored in a string variable and not queryable :(. So i'm trying to use Regular Expressions to solve this problem, with no success for now. I guess it's a mistake in pattern...

public string Processing (string Source, string Tag, string Attribute)
{    
return System.Text.RegularExpressions.Regex.Replace(Source, string.Format(@"<{0}(\s+({1}=""([^""]*)""|\w+=(""[^""]*""|\S+)))+>", Tag, Attribute), string.Empty);
}

...

string before = @"<input type=""text"" name=""Input"" id=""Input"" onMouseOver=""some js to be eliminated"">";
string after = Processing(before,"input","onMouseOver");
// expected : <input type="text" name="Input" id="Input">"

Upvotes: 0

Answers (3)

Alan Moore

Reputation: 75222

That's an interesting approach, but like bobince said, you can only process one attribute per match. This regex will match everything up to the attribute you're interested in:

@"(<{0}\b[^>]*?\b){1}=""(?:[^""]*)"""

Then you use "$1" as your replacement string to plug back in everything but the attribute.

This approach requires you to make a separate pass over the string for each of your target tag/attribute pairs, and at the beginning of each pass you have to create and compile the regex. Not very efficient, but if the string isn't too large it should be okay. A much bigger problem is that it won't catch duplicate attributes; if there are two "onmouseover" attributes on a button, you'll only catch the first one.

If I were doing this in C# I would probably use the regex to match the target tag, then use a MatchEvaluator to remove all of the target attributes at once. But seriously, if the string really is well-formed XML, there's no excuse for not using XML-specific tools to process it--this is what XML was invented for.

Upvotes: 1

Jaded

Reputation: 1831

So, the rewritten code is :

public static string Process(string Source, string Tag, string Attribute)
{
        return Regex.Replace(Source, string.Format(@"(<{0}\b[^>]*?\b)({1}=""(?:[^""]*)"")", Tag, Attribute), "$1");                  
}

I've tested it, and it works fine.

string before = @"<input type=""text"" name=""Input"" id=""Input"" onMouseOver=""some js to be eliminated1""/>"
        + "\r\n" + @"<input type=""text"" name=""Input2"" id=""Input2"" onMouseOver=""some js to be eliminated2"">"
        + "\r\n" + @"<input type=""text"" name=""Input3"" id=""Input3"" onMouseOver=""some js to be eliminated3"">";            
string after = Process(before, "input", "onMouseOver");
//<input type="text" name="Input" id="Input" />
//<input type="text" name="Input2" id="Input2" >
//<input type="text" name="Input3" id="Input3" >

For now the problem is solved. I'd try to use a xml-related workaround, but it seems like before creating XmlDocument i need to rework input html again, because according to w3c validator it has errors. It starts as follows

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <HTML xmlns="http://www.w3.org/1999/xhtml">
    <HEAD>
    <TITLE>page title</TITLE>

On LoadXml i get "System.Xml.XmlException about '>' marker is not acceptable - line 1 position 63. Adding document type definition causes the same exception but this time about '--' marker incorrect , '>' expected.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/strict.dtd">

Any ideas ? Or let it go ?)

Upvotes: 0

bobince

Reputation: 536339

I know that XPath would be a perfect fit for such problem

Quite so. Or any other XML parser-based technique, such as DOM methods.

It's really not a hard thing to learn: stuff your string into the XmlDocument.LoadXml() method then call selectNodes() on it with something like '//tagname[@attrname]' to get a list of elements with the unwanted attribute. Peasy.

i'm trying to use Regular Expressions to solve this problem, with no success

What is it with regexes? People keep using them even when they know it's the wrong thing, even though they're frequently unreadable and difficult to get right (as the endless “why doesn't my regex work?” questions demonstrate).

So what's so attractive about the damned things? There are several questions on SO every day about parsing [X][HT]ML with regex, all answered “don't use regex, regex is not powerful enough to parse HTML”. But somehow it never gets through.

I guess it's a mistake in pattern...

Well the pattern appears to be trying to match entire tags to replace with an empty string, which isn't what you want. Instead you'd want to be targeting just the attribute, then to ensure only attributes inside a “<tag ...>” counted, you'd have to use a negative lookbehind assertion — “(?!<tag )”. But you usually can't have a variable-length lookbehind assertion, which you would need to allow other attributes to come between the tag name and the targeted attribute.

Also your ‘\S+’ clause has the potential to gobble up large amounts of unintended content. As you've got well-formed XHTML, you're guaranteed properly quoted attributes, so you don't need that anyway.

But the mistake is not the pattern. It is regex.

Upvotes: 0

C# - Processing html tag attributes

Answers (3)

Related Questions