Chaitanya R
Chaitanya R

Reputation: 33

Strange behaviour in Java regex

Following code does not find the string "MOVE" present in myStr variable

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
    public static void main(String[] args) {
       String myStr = "    ELSE  MOVE   EXT-LNGSHRT-AMT-C TO WK-UNSIGNED-LNGSHRT-AMT  COMPUTE WK-SHORT-AMT = EXT-LNGSHRT-AMT-C * -1.";
       String verbsRegex = "\\s+(ACCEPT|ADD|ALTER|CALL|CANCEL|CLOSE|COMPUTE|DELETE|DISPLAY|DIVIDE|ELSE|EXIT|EVALUATE|EXEC|GO|GOBACK|IF|INITIALIZE|INSPECT|INVOKE|MERGE|MOVE|MULTIPLY|OPEN|PERFORM|READ|RELEASE|RETURN|REWRITE|SEARCH|SET|SORT|START|STOP|STRING|SUBTRACT|UNSTRING|WRITE|COPY|CONTINUE|WHEN)\\s+";

       Pattern p = Pattern.compile(verbsRegex);
       Matcher m = p.matcher(myStr);
       System.out.println("------------------------------------");
       while (m.find()) {
           System.out.println(myStr.substring(m.start(),m.end()));
           System.out.println("("+ m.group(1) + ")");
       }
       System.out.println("------------------------------------");
    }
}

If I change myStr to something like

       String myStr = "   MOVE  ELSE  MOVE   EXT-LNGSHRT-AMT-C TO WK-UNSIGNED-LNGSHRT-AMT  COMPUTE WK-SHORT-AMT = EXT-LNGSHRT-AMT-C * -1.";

java starts returning me the MOVE. But in this case, ELSE get missed out!

Any explanation for this behavior please? Am I missing something obvious here?

Thanks in advance.

Upvotes: 3

Views: 120

Answers (3)

Pshemo
Pshemo

Reputation: 124225

To print the whole match instead of myStr.substring(m.start(), m.end()) you can use m.group(0) or m.group() (both are the same since group() returns result of group(0)). Also to see whole match surround it with characters like [ ] (just like you did for group(1)).

So instead of

System.out.println(myStr.substring(m.start(),m.end()));

use

System.out.println("["+m.group()+"]");

and you will see that what you are matching is [ ELSE ] and [ COMPUTE ]. As you see you are also matching all possible spaces after your searched tokens. But since your regex requires match to start with at least one whitespace [MOVE ] can't be matched because there is no unmatched whitespace left for it. To solve that problem you can use lookaround mechanism which is zero-length (it doesn't consume matched part).

So instead of \\s+(...)\\s+ you can rewrite it as

(?<=\\s)(...)(?=\\s)

But problem with it is that your token will also need to be surrounded by spaces, so you will not be able to find matches which are placed at start or end of string.

One of solutions could be \b which is word boundary. It represents place which is either start/end of string, or is placed between [a-zA-Z0-9_] and any of non-[a-zA-Z0-9_] character, but that would also represent places between alphabetical characters and - so if you have IF-ELSE it would find separately IF and ELSE even if you want it to be treated as single token which doesn't match any of described in (...) part tokens.

Other solution would be beside accepting space, accepting start and end of string which are represented by ^ and $ (more info at: http://www.regular-expressions.info/anchors.html). In that case your solution could look like

(?<=\\s|^)(...)(?=\\s|$)

BTW usually we try to avoid situations where we write (A|AB) because if A will be enough to match entire regex (depending on how rest of regex looks like) AB will not be tested. So if you have regex like (A|AB) then for string AAB you will find two matches which will be A and A, not A and AB. That is why we usually try to write it from most specific to less specific like (AB|A) (or in case of literals you can try to order them based on their length).

Upvotes: 1

linden2015
linden2015

Reputation: 887

The \s+ at the end clashes with \s+ at the beginning of the pattern. They are greedy, which means it matches up to the word MOVE, leaving no white-space to the left of it, which means it doesn't match.

Change both \s+ to \s+? and MOVE matches. But be aware that it means you're requiring all captured groups to have their own 1-or-more white-space characters. A word boundary or lookaround can solve this.

Upvotes: 3

Youcef LAIDANI
Youcef LAIDANI

Reputation: 59978

Instead of using \s+ you can use \b Word Boundaries to match any word in the group, so your regex should look like this :

\\b(ACCEPT|...|WHEN)\\b

Outputs

------------------------------------
ELSE
(ELSE)
MOVE
(MOVE)
COMPUTE
(COMPUTE)
------------------------------------

Upvotes: 2

Related Questions