Eric Herlitz
Eric Herlitz

Reputation: 26297

Problems with named capturing in c# regex

I've been struggling with this for a while

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?<tag>[^\s/>]+)(?<innerHtml>.*)(?<closeTag>[^\s>]+)>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

string tag = matches[0].Groups["tag"].Value; // "h2"
string innerHtml = matches[0].Groups["innerHtml"].Value; // ">hello world</h"
string closeTag = matches[0].Groups["closeTag"].Value; // "2"

As can be seen tag works as expected while the innerHtml and closeTag does not. Any advice? Thanks.

Update

The input string may vary, this is another scenario "<div class='myclass'><h2>hello world</h2></div>"

Upvotes: 3

Views: 98

Answers (2)

Alan Moore
Alan Moore

Reputation: 75242

You want the Singleline option, not Multiline. Singleline enables . to match linefeeds, while Multiline changes the behavior of the anchors (^ and $), which you aren't using.

Also, if you want the closing tag to have the same name as the opening tag, you should use a backreference. Here I've used '' as the name delimiters instead of <> to reduce confusion:

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?'tag'[^/>]+)(?'innerHtml'.*)</\k'tag'>",
    RegexOptions.IgnoreCase | RegexOptions.Singleline);

And you don't need the Compiled option. All it does is make it more expensive to create the Regex object, for an increase in performance that you almost certainly don't need and won't notice.

Upvotes: 0

p.s.w.g
p.s.w.g

Reputation: 149020

Try matching the > and </ outside of the capture groups, like this:

var matches = Regex.Matches("<h2>hello world</h2>",
    @"<(?<tag>[^\s/>]+)>(?<innerHtml>.*)</(?<closeTag>[^\s>]+)>",
    RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);

Update More specific example that should be a little more flexible:

var matches = Regex.Matches(
    "<div class='myclass'><h2>hello world</h2></div>",
    @"<(?<tag>[^\s>]+)               #Opening tag
        \s*(?<attributes>[^>]*)\s*>  #Attributes inside tag (optional)
      (?<innerHtml>.*)               #Inner Html
      </(?<closeTag>\1)>             #Closing tag, must match opening tag",
    RegexOptions.IgnoreCase | 
    RegexOptions.Compiled | 
    RegexOptions.Multiline |
    RegexOptions.IgnorePatternWhitespace);

string tag = matches[0].Groups["tag"].Value;             // "div"
string attr = matches[0].Groups["attributes"].Value;     // "class='myclass'"
string innerHtml = matches[0].Groups["innerHtml"].Value; // "<h2>hello world</h2>"
string closeTag = matches[0].Groups["closeTag"].Value;   // "div"

Upvotes: 1

Related Questions