Reputation: 73
I have this reference working Perl script with a regex, copied from a Java snippet that isn't giving the expected results:
my $regex = '^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$';
if ("A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A" =~ /$regex/)
{
print "Matches 1=$1 2=$2 3=$3 4=$4\n";
}
This correctly outputs:
Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A
Now the equivalent Java snippet:
private static final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
private static final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(MutableUniqueIdentity.NON_SYSTEM_TYPE_REGEX);
...
final Matcher match = MutableUniqueIdentity.NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);
The uniqueIdentity
input is further back in the stack trace (in a unit test) and is this value:
final String id5CompactString = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";
NOTE: The regex and uniqueIdentity
values were copied to the Perl program from a debug session to assert if a different language comes up with a different result (which it did).
ADDITIONAL NOTE: The reason the non-capture group is there is to allow the third element in the string to be optional, so it has to deal with both of these:
A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A
A-PROD-COMP-00000000-0000-8033-0000-000200354F0A
My unit test fails in Java - the third match group, which should be LOGL
, is in fact 0000
.
Here is a screenshot of the debugger right after the regex match line above:
You can see that the pattern matches, you can verify that the input parameter (text
) and regex are the same as the Perl script, but the result is different!
So my question is: Why does match.groups(3)
have a value of 0000
(when it should have a value LOGL
) and how does that related back to the regex and the string it is applied to?
In Perl it yields the correct result - LOGL
.
Additional info: I have perused this page that highlights the differences between Perl and Java regex engines, and there doesn't appear to be anything applicable.
Upvotes: 0
Views: 186
Reputation: 73
Ok I've made it work, but I don't understand why.
The regex needs to be made non-greedy, so instead of:
^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
it needs to be:
^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})(?:-([A-Z0-9]{4}))*?-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
(with the extra ?
after the *
of the non-capture group)
Upvotes: 0
Reputation: 79425
Replace your regex with the following regex:
^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$
This has been moved out----------^
I have moved -
out of the non-capturing group.
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
final String NON_SYSTEM_TYPE_REGEX = "^[AT]-([A-Z0-9]{4})-([A-Z0-9]{4})-(?:([A-Z0-9]{4}))*-([A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12})$";
final Pattern NON_SYSTEM_TYPE_PATTERN = Pattern.compile(NON_SYSTEM_TYPE_REGEX);
String uniqueIdentity = "A-PROD-COMP-LOGL-00000000-0000-8033-0000-000200354F0A";
final Matcher match = NON_SYSTEM_TYPE_PATTERN.matcher(uniqueIdentity);
if (match.find()) {
System.out.printf("Matches 1=%s 2=%s 3=%s 4=%s%n", match.group(1), match.group(2), match.group(3),
match.group(4));
}
}
}
Output:
Matches 1=PROD 2=COMP 3=LOGL 4=00000000-0000-8033-0000-000200354F0A
Check the demo at regex101 as well.
Upvotes: 1