yelsayed
yelsayed

Reputation: 5542

C# Regular expression returns group multiple times

I have a very simple regex like this in C#:

(var \= 0\;)

But when I try to match this against a string that has only one occurrence of the pattern, I get multiple groups returned. The input string is:

foo bar
var = 0;
foo

I get 1 match returned by the Regex object, but inside I see two groups, each has 1 capture, which is the string I want. I need the grouping parentheses in the regex because this is part of a bigger regex, and I need this to be captured as a group. What am I doing wrong?

EDIT

This is the C# code I'm using:

private const string REGEX = "(var \\= [0]\\;)";
MatchCollection matches = REGEX.Matches(inputStr);
foreach (Match m in matches)
{
    foreach (Group g in m.Groups)
    {
        Console.WriteLine("group[" + g.Captures.Count + "]: '" + g.ToString() + "'");
    }
}

This is what I get:

group[1]: 'var = 0;'
group[1]: 'var = 0;'

My question is, why do I get two groups and not one?

EDIT #2:

A more complicated pattern shows the problem. The pattern:

# preceding comment
class
{
   (param1 = "val1", param2 = "val2", param3 = val3)
}
[
    # inside comment
    setting1 = 0;
    setting2 = 0;
]

The regex I'm using: (it's probably not the most obvious, but you can paste it in a regex viewer if you want to check it out)

(\#[^\n]*)?(?:[\s\r\n]*)domain(?:[\s\r\n]*)\{(?:[\s\r\n]*)\((?:[\s\r\n]*)(((?:[\s\r\n]*)(accountName(?:[\s\r\n]*)\=(?:[\s\r\n]*)\"[^"]+\"[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(tableName(?:[\s\r\n]*)\=(?:[\s\r\n]*)\"[^"]+\"[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(cap(?:[\s\r\n]*)\=(?:[\s\r\n]*)[\d]+[,]?)(?:[\s\r\n]*))|((?:[\s\r\n]*)(MinPartitionCount(?:[\s\r\n]*)\=(?:[\s\r\n]*)[\d]+[,]?)(?:[\s\r\n]*)))+\)(?:[\s\r\n]*)\}(?:[\s\r\n]*)\[(?:[\s\r\n]*)(\#[^\n]*)?(?:[\s\r\n]*)((?:[\s\r\n]*)(IsSplitEnabled(?:[\s\r\n]*)\=(?:[\s\r\n]*)[0|1](?:[\s\r\n]*)\;)(?:[\s\r\n]*)|(?:[\s\r\n]*)(IsMergeEnabled(?:[\s\r\n]*)\=(?:[\s\r\n]*)[0|1](?:[\s\r\n]*)\;)(?:[\s\r\n]*))*(?:[\s\r\n]*)\]

And I'm getting:

group:1: '# preceding comment
domain
{
   (param1 = "val1", param2 = "val2", param3 = val3)
}
[
    # inside comment
    setting1 = 0;
    setting2 = 0;
]'
'roup:1: '# preceding comment
group:3: 'cap = 1200'
group:1: 'param1 = "val1", '
group:1: 'param1 = "val1",'
group:1: 'param2 = "val2", '
group:1: 'param2 = "val2",'
group:1: 'param3 = val3'
group:1: 'param3 = val3'
'roup:1: '# inside comment
group:2: 'setting1 = 0;
'
group:1: 'setting1 = 0;'
group:1: 'setting2 = 0;'

Upvotes: 1

Views: 605

Answers (1)

davisoa
davisoa

Reputation: 5439

According to the documentation, the first element of the GroupCollection is the entire match, not the first group created by ().

From near the bottom of the Remarks section here:

If the regular expression engine can find a match, the first element of the GroupCollection object returned by the Groups property contains a string that matches the entire regular expression pattern. Each subsequent element > represents a captured group, if the regular expression includes capturing groups.

Due to this, both items 0 and 1 are identical given the RegEx you are currently using. To only see the actual group matches, you could skip the first element of the GroupCollection, and only process the groups you have defined in the RegEx.

EDIT

After investigating the additional data, I think I may have found the cause of your duplicates.

I believe that you are seeing more than one Match, and so the outer foreach loop runs twice, not once. This is because there are 2 separate lines with "= 0;" in the example.

Here is LinqPad example code that shows 2 matches being found, and therefore multiple duplicate groups being output. (note, I used the simple regex you provided to test, since the long regex didn't provide any matches)

static string inputStr = "# preceding comment \r\n" + 
"class\r\n" + 
"{\r\n" + 
"   (param1 = \"val1\", param2 = \"val2\", param3 = val3)\r\n" + 
"}\r\n" + 
"[\r\n" + 
"    # inside comment\r\n" + 
"    setting1 = 0;\r\n" + 
"    setting2 = 0;\r\n" + 
"]\r\n";

const string REGEX = "(\\= [0]\\;)";

void Main()
{

    var regex = new System.Text.RegularExpressions.Regex(REGEX);
    MatchCollection matches = regex.Matches(inputStr);
    Console.WriteLine("Matches:{0}", matches.Count);
    int matchCnt = 0;
    foreach (Match m in matches)
    {
        int groupCnt = 0;
        foreach (Group g in m.Groups)
        {
            Console.WriteLine("match[{0}] group[{1}]: Captures:{2} '{3}'", matchCnt, groupCnt, g.Captures.Count, g);
            //g.Dump();
            groupCnt++;
        }
        matchCnt++;
    }
    Console.WriteLine("Done!");
}

And here is the output generated by LinqPad when this code runs:

Matches:2
match[0] group[0]: Captures:1 '= 0;'
match[0] group[1]: Captures:1 '= 0;'
match[1] group[0]: Captures:1 '= 0;'
match[1] group[1]: Captures:1 '= 0;'
Done!

Upvotes: 2

Related Questions