Andrew
Andrew

Reputation: 836

Java Regular Expression Two Question marks (??)

I know that /? means the / is optional. so "toys?" will match both toy and toys. My understanding is that if I make it lazy and use "toys??" I will match both toy and toys and always return toy. So, a quick test:

private final static Pattern TEST_PATTERN = Pattern.compile("toys??", Pattern.CASE_INSENSITIVE);
public static void main(String[] args) {
    for(String arg : args) {
        Matcher m = TEST_PATTERN.matcher(arg);
        System.out.print("Arg: " + arg);
        boolean b = false;
        while (m.find()) {
            System.out.print(" {");
            for (int i=0; i<=m.groupCount(); ++i) {
                System.out.print("[" + m.group(i) + "]");
            }
            System.out.print("}");
        }
        System.out.println();
    }
}

Yep, it looks like it works as expected

java -cp .. regextest.RegExTest toy toys
Arg: toy {[toy]}
Arg: toys {[toy]}

Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".

The reason I am asking is because I found an example like the following:

private final static Pattern TEST_PATTERN = Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE);

and although I see no apparent reason for using ?? rather than ?, I thought that perhaps the original author (who is not known to me) might know something that I don't, I expect the later.

Upvotes: 10

Views: 5767

Answers (1)

nhahtdh
nhahtdh

Reputation: 56809

?? is lazy while ? is greedy.

Given (pattern)??, it will first test for empty string, then if the rest of the pattern can't match, it will test for pattern.

In contrast, (pattern)? will test for pattern first, then it will test for empty string on backtrack.


Now, change the regular expression to "toys??2" and it still matches toys2 and toy2. In both cases, it returns the entire string without the s removed. Is there any functional difference between searching for "toys?2" and "toys??2".

The difference is in the order of searching:

  • "toys?2" searches for toys2, then toy2
  • "toys??2" searches for toy2, then toys2

But for the case of these 2 patterns, the result will be the same regardless of the input string, since the sequel 2 (after s? or s??) must be matched.


As for the pattern you found:

Pattern.compile("</??tag(\\s+?.*?)??>", Pattern.CASE_INSENSITIVE)

Both ?? can be changed to ? without affecting the result:

  • / and t (in tag) are mutually exclusive. You either match one or the other.
  • > and \s are also mutually exclusive. The at least 1 in \s+? is important to this conclusion: the result might be different otherwise.

This is probably micro-optimization from the author. He probably thinks that the open tag must be there, while the closing tag might be forgotten, and that open/close tags without attributes/random spaces appears more often than those with some.

By the way, the engine might run into some expensive backtracking attempt due to \\s+?.*? when the input has <tag followed by lots of spaces without > anywhere near.

Upvotes: 25

Related Questions