Reputation: 4768
I need to replace all occurences of span
having id="comment_n"
, where n
can be any number and any occurence of this qualifying span
can have nested ones. Each span
can have different attributes. Example:
foo <span id="comment_1">text <span id="comment_2" attr="value">text.</span></span> bar
I have this regular expression:
<span id="comment_\d+.+?<\/span>
But it doesn't include the last closing span
tag.
I need to do a replace:
Regex.Replace(input, regex, string.Empty, RegexOptions.Multiline | RegexOptions.IgnoreCase);
Demo: http://regexr.com/3bpkf
Upvotes: 0
Views: 69
Reputation: 627119
I suggest using HtmlAgilityPack to obtain what you need. You can specify the XPath to only get the <span>
tags having id
attribute that starts with comment_
(case-insensitive) and then remove them. Additional check for the number after comment_
can be done with a regex, or without. Here is a way to remove some tags having specific attribute value where this value is checked with a regex.
public string HtmlAgilityPackRemoveTagsWithSpecificAttribute(string html, string xpath, string attribute_name, Regex rx)
{
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) &&
uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.SelectNodes(xpath);
if (nodes != null)
{
foreach (var node in nodes)
{
if (rx.IsMatch(node.Attributes[attribute_name].Value))
node.ParentNode.RemoveChild(node);
}
}
return hap.DocumentNode.OuterHtml;
}
You can use it like this:
var res = HtmlAgilityPackRemoveTagsWithSpecificAttribute(html,
"//span[starts-with(translate(@id, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz'), 'comment_')]", "id",
new Regex("(?i)^comment_[0-9]+$"));
Note that translate
is used to enable case-insensitive comparison (comment_
, COMMENT_
, etc.). If you do not need that, just use starts-with(@id, 'comment_')]"
.
The regex can be instantiated before passing to the method if you use it more than once, or use a static Regex.IsMatch
and replace the method signature.
Upvotes: 2
Reputation: 24429
As to why it doesn't include the last closing span
tag, it's because of the ?
in your regex pattern, that makes it "lazy" causing it to match the shortest satisfying string, if you remove that, the match will include the last 'span' tag:
<span id="comment_\d+.+<\/span>
But I'd suggest using HtmlAgilityPack for parsing your DOM and manipulating it.
Upvotes: -1