Reputation: 26297
I've been struggling with this for a while
var matches = Regex.Matches("<h2>hello world</h2>",
@"<(?<tag>[^\s/>]+)(?<innerHtml>.*)(?<closeTag>[^\s>]+)>",
RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);
string tag = matches[0].Groups["tag"].Value; // "h2"
string innerHtml = matches[0].Groups["innerHtml"].Value; // ">hello world</h"
string closeTag = matches[0].Groups["closeTag"].Value; // "2"
As can be seen tag
works as expected while the innerHtml
and closeTag
does not. Any advice? Thanks.
Update
The input string may vary, this is another scenario
"<div class='myclass'><h2>hello world</h2></div>"
Upvotes: 3
Views: 98
Reputation: 75242
You want the Singleline
option, not Multiline
. Singleline
enables .
to match linefeeds, while Multiline
changes the behavior of the anchors (^
and $
), which you aren't using.
Also, if you want the closing tag to have the same name as the opening tag, you should use a backreference. Here I've used ''
as the name delimiters instead of <>
to reduce confusion:
var matches = Regex.Matches("<h2>hello world</h2>",
@"<(?'tag'[^/>]+)(?'innerHtml'.*)</\k'tag'>",
RegexOptions.IgnoreCase | RegexOptions.Singleline);
And you don't need the Compiled
option. All it does is make it more expensive to create the Regex object, for an increase in performance that you almost certainly don't need and won't notice.
Upvotes: 0
Reputation: 149020
Try matching the >
and </
outside of the capture groups, like this:
var matches = Regex.Matches("<h2>hello world</h2>",
@"<(?<tag>[^\s/>]+)>(?<innerHtml>.*)</(?<closeTag>[^\s>]+)>",
RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);
Update More specific example that should be a little more flexible:
var matches = Regex.Matches(
"<div class='myclass'><h2>hello world</h2></div>",
@"<(?<tag>[^\s>]+) #Opening tag
\s*(?<attributes>[^>]*)\s*> #Attributes inside tag (optional)
(?<innerHtml>.*) #Inner Html
</(?<closeTag>\1)> #Closing tag, must match opening tag",
RegexOptions.IgnoreCase |
RegexOptions.Compiled |
RegexOptions.Multiline |
RegexOptions.IgnorePatternWhitespace);
string tag = matches[0].Groups["tag"].Value; // "div"
string attr = matches[0].Groups["attributes"].Value; // "class='myclass'"
string innerHtml = matches[0].Groups["innerHtml"].Value; // "<h2>hello world</h2>"
string closeTag = matches[0].Groups["closeTag"].Value; // "div"
Upvotes: 1