Replace matching elements including nested ones

Question

I need to replace all occurences of span having id="comment_n", where n can be any number and any occurence of this qualifying span can have nested ones. Each span can have different attributes. Example:

foo text text. bar

I have this regular expression:

http://regexr.com/3bpkf

Wiktor Stribiżew · Accepted Answer

I suggest using HtmlAgilityPack to obtain what you need. You can specify the XPath to only get the tags having id attribute that starts with comment_ (case-insensitive) and then remove them. Additional check for the number after comment_ can be done with a regex, or without. Here is a way to remove some tags having specific attribute value where this value is checked with a regex.

public string HtmlAgilityPackRemoveTagsWithSpecificAttribute(string html, string xpath, string attribute_name, Regex rx)
{
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) &&
                              uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes(xpath);
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           if (rx.IsMatch(node.Attributes[attribute_name].Value))
               node.ParentNode.RemoveChild(node);
       }
    }
    return hap.DocumentNode.OuterHtml;
}

You can use it like this:

var res = HtmlAgilityPackRemoveTagsWithSpecificAttribute(html,
  "//span[starts-with(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
           'abcdefghijklmnopqrstuvwxyz'), 'comment_')]", "id", 
                new Regex("(?i)^comment_[0-9]+$"));

Note that translate is used to enable case-insensitive comparison (comment_, COMMENT_, etc.). If you do not need that, just use starts-with(@id, 'comment_')]".

The regex can be instantiated before passing to the method if you use it more than once, or use a static Regex.IsMatch and replace the method signature.

Replace matching elements including nested ones

Answers (2)

Related Questions