user1122069
user1122069

Reputation: 1817

Java Non-Capturing Regex Group gives "null" captures

I've got a java regex with (?:) formatted non-capturing groups, and I can't understand why it gives "null" matches for the non-capturing groups.

If I shorten the regex below to "@te(st)(?:aa)?" with the same ?: non-capturing group, it gives what I would consider expected behavior, matching only 1 group and the full match.

See the regex below:

package com.company;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {
        final String regex = "@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$";
        final String string = "    /**\n     * @test     TestGroup\n     */\n";

        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);

        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            for (int i = 1; i <= matcher.groupCount(); i++) {
                System.out.println("Group " + i + ": " + matcher.group(i));
            }
        }
    }
}

Result:

Full match: @test     TestGroup
Group 1: TestGroup
Group 2: null
Group 3: null

Result of "@te(st)(?:aa)?" with same code:

Full match: @test
Group 1: st

What is it about the first regex that matches the non-capturing groups as null?

Upvotes: 2

Views: 1774

Answers (1)

Susam Pal
Susam Pal

Reputation: 34224

Answer

This is the regex pattern in the question:

"@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$"

This regex pattern has three capturing groups:

  1. ([:.\\w\\\\x7f-\\xff]+)
  2. (\\S*)
  3. (\\S*)

So your first example is not matching the non-capturing groups as null. Instead, as expected, it is matching the last two capturing groups as null.

Another example that matches all capturing groups

If we change the example string to be matched to something that can match all three capturing groups in the pattern, we would see three matches. For example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {
        final String regex = "@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$";
        final String string = "foo @test : bar baz\n";
        // final String string = "    /**\n     * @test     TestGroup\n     */\n";

        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        final Matcher matcher = pattern.matcher(string);

        while (matcher.find()) {
            System.out.println("Full match: " + matcher.group(0));
            for (int i = 1; i <= matcher.groupCount(); i++) {
                System.out.println("Group " + i + ": " + matcher.group(i));
            }
        }
    }
}

The output of the above code is:

Full match: @test : bar baz

Group 1: :
Group 2: bar
Group 3: baz

A few more examples in other languages follow to show that this behaviour is more or less the same across implementations.

Python Example

import re

regex = re.compile('@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$', re.MULTILINE)
s1 = '    /**\n     * @test     TestGroup\n     */\n'
s2 = 'foo @test : bar baz';

match = re.search(regex, s1)
for i in range(regex.groups + 1):
    print('Group {}: {}'.format(i, match.group(i)))
print()

match = re.search(regex, s2)
for i in range(regex.groups + 1):
    print('Group {}: {}'.format(i, match.group(i)))

The output is:

Group 0: @test     TestGroup
Group 1: TestGroup
Group 2: None
Group 3: None

Group 0: @test : bar baz
Group 1: :
Group 2: bar
Group 3: baz

The second match shows that capturing groups within non-capturing groups are indeed matched. The only thing that is little different with Python is that the groups that does not match do not appear in the output in the first example.

JavaScript Example

var regex = new RegExp('@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$', 'm')
var s1 = '    /**\n     * @test     TestGroup\n     */\n'
var s2 = 'foo @test : bar baz';
var i

var result = regex.exec(s1)
for (i = 0; i < result.length; i++) {
    console.log('result[' + i + '] :', result[i])
}
console.log()

var result = regex.exec(s2)
for (i = 0; i < result.length; i++) {
    console.log('result[' + i + '] :', result[i])
}

The output is:

result[0] : @test     TestGroup
result[1] : TestGroup
result[2] : undefined
result[3] : undefined

result[0] : @test : bar baz
result[1] : :
result[2] : bar
result[3] : baz

PHP Example

<?php
$regex = "/@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$/m";
$s1 = "    /**\n     * @test     TestGroup\n     */\n";
$s2 = "foo @test : bar baz";

preg_match($regex, $s1, $matches);
for ($i = 0; $i < count($matches); $i++) {
    echo "Match $i: $matches[$i]\n";
}
echo "\n";

preg_match($regex, $s2, $matches);
for ($i = 0; $i < count($matches); $i++) {
    echo "Match $i: $matches[$i]\n";
}
?>

The output is:

Match 0: @test     TestGroup
Match 1: TestGroup

Match 0: @test : bar baz
Match 1: :
Match 2: bar
Match 3: baz

Upvotes: 4

Related Questions