illegal-immigrant
illegal-immigrant

Reputation: 8244

Regex.Split() strange behaviour

I tried following regex to split data in a text file, but I found a strange bug during testing - pretty simple file was spitted clearly incorrect. Sample code to illustrate such behavior:

        const string line = "511525,3122,9,39,2007,9,39,3127,9,39,\" -49,368.11 \",\"-32,724.16\",2,1,\" 2,347.91 \", -   ,\" 2,234.17 \", -   ,2.2,1.143,2,1.24,FALSE,1,2,0,311,511625";
        const string pattern = ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)";

        Console.WriteLine();
        Console.WriteLine("SPLIT");
        var splitted = Regex.Split(line, pattern, RegexOptions.Compiled);
        foreach (var s in splitted)
        {
            Console.WriteLine(s);
        }

        Console.WriteLine();
        Console.WriteLine("REPLACE");
        var replaced = Regex.Replace(line, pattern, "!" , RegexOptions.Compiled);
        Console.WriteLine(replaced);

        Console.WriteLine();
        Console.WriteLine("MATCH");
        var matches = Regex.Matches(line, pattern);
        foreach (Match match in matches)
        {
            Console.WriteLine(match.Index);
        }

So, as you can see, split is the only method which produces unexpected results(it splits on invalid positions!)!Both Matches and Replace give absolutely correct results. I even tried to test mentioned regex in RegexBuddy, and it showed same matches as Regex.Matches! Am i missing something or it looks like a bug in Split method?

Console output:

SPLIT
511525
, -   ," 2,234.17 "
3122
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
2007
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
3127
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
" -49,368.11 "
, -   ," 2,234.17 "
"-32,724.16"
, -   ," 2,234.17 "
2
, -   ," 2,234.17 "
1
, -   ," 2,234.17 "
" 2,347.91 "
 -   ," 2,234.17 "
 -
" 2,234.17 "
" 2,234.17 "
 -
2.2
1.143
2
1.24
FALSE
1
2
0
311
511625

REPLACE
511525!3122!9!39!2007!9!39!3127!9!39!" -49,368.11 "!"-32,724.16"!2!1!" 2,347.91 "! -   !" 2,234.17 "! -   !2.2!1.143!2!1.24!FALSE!1!2!0!311!511625

MATCH
6
11
13
16
21
23
26
31
33
36
51
64
66
68
81
87
100
106
110
116
118
123
129
131
133
135
139

Upvotes: 4

Views: 268

Answers (2)

Mark Peters
Mark Peters

Reputation: 17775

Based on your response from Microsoft (add ExplicitCapture) it seems the problem is the capturing group. The ExplicitCapture option would turn that capturing group into a non-capturing group

You can do the same without the option by making the group explicitly non-capturing:

const string pattern = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";

which, testing with LINQPad, seems to produce the results are looking for.

Whether there are any capturing groups makes a difference as described in the docs for Regex.Split

If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, splitting the string " plum-pear" on a hyphen placed within capturing parentheses adds a string element that contains the hyphen to the returned array.

Upvotes: 2

illegal-immigrant
illegal-immigrant

Reputation: 8244

Solution from MS

(Adding ExplicitCapture regex option)

Upvotes: 2

Related Questions