Reputation: 1817
I've got a java regex with (?:) formatted non-capturing groups, and I can't understand why it gives "null" matches for the non-capturing groups.
If I shorten the regex below to "@te(st)(?:aa)?" with the same ?: non-capturing group, it gives what I would consider expected behavior, matching only 1 group and the full match.
See the regex below:
package com.company;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
final String regex = "@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$";
final String string = " /**\n * @test TestGroup\n */\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Result:
Full match: @test TestGroup
Group 1: TestGroup
Group 2: null
Group 3: null
Result of "@te(st)(?:aa)?" with same code:
Full match: @test
Group 1: st
What is it about the first regex that matches the non-capturing groups as null?
Upvotes: 2
Views: 1774
Reputation: 34224
This is the regex pattern in the question:
"@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$"
This regex pattern has three capturing groups:
([:.\\w\\\\x7f-\\xff]+)
(\\S*)
(\\S*)
So your first example is not matching the non-capturing groups as null
. Instead, as expected, it is matching the last two capturing groups as null
.
If we change the example string to be matched to something that can match all three capturing groups in the pattern, we would see three matches. For example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
final String regex = "@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$";
final String string = "foo @test : bar baz\n";
// final String string = " /**\n * @test TestGroup\n */\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
The output of the above code is:
Full match: @test : bar baz
Group 1: :
Group 2: bar
Group 3: baz
A few more examples in other languages follow to show that this behaviour is more or less the same across implementations.
import re
regex = re.compile('@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$', re.MULTILINE)
s1 = ' /**\n * @test TestGroup\n */\n'
s2 = 'foo @test : bar baz';
match = re.search(regex, s1)
for i in range(regex.groups + 1):
print('Group {}: {}'.format(i, match.group(i)))
print()
match = re.search(regex, s2)
for i in range(regex.groups + 1):
print('Group {}: {}'.format(i, match.group(i)))
The output is:
Group 0: @test TestGroup
Group 1: TestGroup
Group 2: None
Group 3: None
Group 0: @test : bar baz
Group 1: :
Group 2: bar
Group 3: baz
The second match shows that capturing groups within non-capturing groups are indeed matched. The only thing that is little different with Python is that the groups that does not match do not appear in the output in the first example.
var regex = new RegExp('@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$', 'm')
var s1 = ' /**\n * @test TestGroup\n */\n'
var s2 = 'foo @test : bar baz';
var i
var result = regex.exec(s1)
for (i = 0; i < result.length; i++) {
console.log('result[' + i + '] :', result[i])
}
console.log()
var result = regex.exec(s2)
for (i = 0; i < result.length; i++) {
console.log('result[' + i + '] :', result[i])
}
The output is:
result[0] : @test TestGroup
result[1] : TestGroup
result[2] : undefined
result[3] : undefined
result[0] : @test : bar baz
result[1] : :
result[2] : bar
result[3] : baz
<?php
$regex = "/@test\\s+([:.\\w\\\\x7f-\\xff]+)(?:[\\t ]+(\\S*))?(?:[\\t ]+(\\S*))?\\s*$/m";
$s1 = " /**\n * @test TestGroup\n */\n";
$s2 = "foo @test : bar baz";
preg_match($regex, $s1, $matches);
for ($i = 0; $i < count($matches); $i++) {
echo "Match $i: $matches[$i]\n";
}
echo "\n";
preg_match($regex, $s2, $matches);
for ($i = 0; $i < count($matches); $i++) {
echo "Match $i: $matches[$i]\n";
}
?>
The output is:
Match 0: @test TestGroup
Match 1: TestGroup
Match 0: @test : bar baz
Match 1: :
Match 2: bar
Match 3: baz
Upvotes: 4