Jules Sam. Randolph
Jules Sam. Randolph

Reputation: 4220

Regex matching with Java Matcher won't find as expected

Here is the regex I'm having issues with: ^(?:(\S+?)(?:\s+|\s*$)). I'm trying to match the 3 occurrences of this pattern in the following String: -execution thisIsTest1 thisIsTest2. Here is the method grabbing the first numberOfArgs elements and returning a List<String> filled with matched items. The problem is : the size of the returned List is 1.... The loop always iterate once and then exit...

private final String arguments="-execution  thisIsTest1  thisIsTest2";
 /**
 * Split the first N arguments separated with one or more whitespaces.
 * @return the array of size numberOfArgs containing the matched elements.
 */
...
public List<String> fragmentFirstN(int numberOfArgs){
    Pattern patt = Pattern.compile("^(?:(\\S+?)(?:\\s+|\\s*$))",Pattern.MULTILINE);
    Matcher matc = patt.matcher(arguments);
    ArrayList<String> args = new ArrayList<>();
    logg.info(arguments);
    int i = 0;
    while(matc.find()&&i<numberOfArgs){
        args.add(matc.group(1));
        i++;
    }
    return args;
}

And here is the test class :

private String[] argArr={"-execution",
        "thisIsTest1",
        "thisIsTest2"
};
...
@Test
public void testFragmentFirstN() throws Exception {
    List<String> arr = test.fragmentFirstN(3);
    assertNotNull(arr);
    System.out.println(arr); ----> prints : [-execution]
    System.out.println(test.getArguments()); ----> prints : -execution  thisIsTest1  thisIsTest2 <-----
    assertEquals(argArr[0],arr.get(0));
--->assertEquals(argArr[1],arr.get(1));<---- IndexOutOfBoundException : Index: 1, Size: 1
    assertEquals(argArr[2],arr.get(2));
    assertEquals(3,arr.size());
}

I thought Matcher#find() would match all possible char sequence when looped over. What am I missing?

Upvotes: 1

Views: 575

Answers (1)

M A
M A

Reputation: 72844

The problem is that the regex has a boundary matcher that matches the start of the input string (the ^ character). The first time Matcher.find() is invoked in the loop, the matched substring is -execution. This is because -execution starts at the beginning of the string and the regex has the part (?:\\s+|\\s*$) that means detects either space characters (which is the case after -execution) or non-space characters at the end of the input string.

The second iteration will not match any string because the matcher is no longer at the start of the input string. Hence Matcher.find() returns false.

You can try removing the character:

Pattern patt = Pattern.compile("(?:(\\S+?)(?:\\s+|\\s*$))",
            Pattern.MULTILINE);

EDIT:

Based on @ajb's comments, simply removing the ^ character would make the regex match an input string that starts with whitespace. In case this is not desired, you can instead replace ^ with \G which marks the end of the previous match by the matcher:

Pattern patt = Pattern.compile("\\G(?:(\\S+?)(?:\\s+|\\s*$))",
            Pattern.MULTILINE);

Upvotes: 2

Related Questions