Replace matching elements including nested ones

I need to replace all occurences of span having id="comment_n", where n can be any number and any occurence of this qualifying span can have nested ones. Each span can have different attributes. Example:

foo <span id="comment_1">text <span id="comment_2" attr="value">text.</span></span> bar

I have this regular expression:

<span id="comment_\d+.+?<\/span>

But it doesn't include the last closing span tag.

I need to do a replace:

Regex.Replace(input, regex, string.Empty, RegexOptions.Multiline | RegexOptions.IgnoreCase);

Demo: http://regexr.com/3bpkf

Upvotes: 0

Views: 69

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627119

I suggest using HtmlAgilityPack to obtain what you need. You can specify the XPath to only get the <span> tags having id attribute that starts with comment_ (case-insensitive) and then remove them. Additional check for the number after comment_ can be done with a regex, or without. Here is a way to remove some tags having specific attribute value where this value is checked with a regex.

public string HtmlAgilityPackRemoveTagsWithSpecificAttribute(string html, string xpath, string attribute_name, Regex rx)
{
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) &&
                              uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes(xpath);
    if (nodes != null)
    {
       foreach (var node in nodes)
       {
           if (rx.IsMatch(node.Attributes[attribute_name].Value))
               node.ParentNode.RemoveChild(node);
       }
    }
    return hap.DocumentNode.OuterHtml;
}

You can use it like this:

var res = HtmlAgilityPackRemoveTagsWithSpecificAttribute(html,
  "//span[starts-with(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
           'abcdefghijklmnopqrstuvwxyz'), 'comment_')]", "id", 
                new Regex("(?i)^comment_[0-9]+$"));

Note that translate is used to enable case-insensitive comparison (comment_, COMMENT_, etc.). If you do not need that, just use starts-with(@id, 'comment_')]".

The regex can be instantiated before passing to the method if you use it more than once, or use a static Regex.IsMatch and replace the method signature.

Upvotes: 2

Saeb Amini
Saeb Amini

Reputation: 24429

As to why it doesn't include the last closing span tag, it's because of the ? in your regex pattern, that makes it "lazy" causing it to match the shortest satisfying string, if you remove that, the match will include the last 'span' tag:

<span id="comment_\d+.+<\/span>

But I'd suggest using HtmlAgilityPack for parsing your DOM and manipulating it.

Upvotes: -1

Related Questions