Reputation: 33
Following code does not find the string "MOVE" present in myStr variable
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String myStr = " ELSE MOVE EXT-LNGSHRT-AMT-C TO WK-UNSIGNED-LNGSHRT-AMT COMPUTE WK-SHORT-AMT = EXT-LNGSHRT-AMT-C * -1.";
String verbsRegex = "\\s+(ACCEPT|ADD|ALTER|CALL|CANCEL|CLOSE|COMPUTE|DELETE|DISPLAY|DIVIDE|ELSE|EXIT|EVALUATE|EXEC|GO|GOBACK|IF|INITIALIZE|INSPECT|INVOKE|MERGE|MOVE|MULTIPLY|OPEN|PERFORM|READ|RELEASE|RETURN|REWRITE|SEARCH|SET|SORT|START|STOP|STRING|SUBTRACT|UNSTRING|WRITE|COPY|CONTINUE|WHEN)\\s+";
Pattern p = Pattern.compile(verbsRegex);
Matcher m = p.matcher(myStr);
System.out.println("------------------------------------");
while (m.find()) {
System.out.println(myStr.substring(m.start(),m.end()));
System.out.println("("+ m.group(1) + ")");
}
System.out.println("------------------------------------");
}
}
If I change myStr to something like
String myStr = " MOVE ELSE MOVE EXT-LNGSHRT-AMT-C TO WK-UNSIGNED-LNGSHRT-AMT COMPUTE WK-SHORT-AMT = EXT-LNGSHRT-AMT-C * -1.";
java starts returning me the MOVE. But in this case, ELSE get missed out!
Any explanation for this behavior please? Am I missing something obvious here?
Thanks in advance.
Upvotes: 3
Views: 120
Reputation: 124225
To print the whole match instead of myStr.substring(m.start(), m.end())
you can use m.group(0)
or m.group()
(both are the same since group()
returns result of group(0)
). Also to see whole match surround it with characters like [
]
(just like you did for group(1)).
So instead of
System.out.println(myStr.substring(m.start(),m.end()));
use
System.out.println("["+m.group()+"]");
and you will see that what you are matching is [ ELSE ]
and [ COMPUTE ]
. As you see you are also matching all possible spaces after your searched tokens. But since your regex requires match to start with at least one whitespace [MOVE ]
can't be matched because there is no unmatched whitespace left for it. To solve that problem you can use lookaround mechanism which is zero-length (it doesn't consume matched part).
So instead of \\s+(...)\\s+
you can rewrite it as
(?<=\\s)(...)(?=\\s)
But problem with it is that your token will also need to be surrounded by spaces, so you will not be able to find matches which are placed at start or end of string.
One of solutions could be \b
which is word boundary. It represents place which is either start/end of string, or is placed between [a-zA-Z0-9_]
and any of non-[a-zA-Z0-9_]
character, but that would also represent places between alphabetical characters and -
so if you have IF-ELSE
it would find separately IF
and ELSE
even if you want it to be treated as single token which doesn't match any of described in (...)
part tokens.
Other solution would be beside accepting space, accepting start and end of string which are represented by ^
and $
(more info at: http://www.regular-expressions.info/anchors.html). In that case your solution could look like
(?<=\\s|^)(...)(?=\\s|$)
BTW usually we try to avoid situations where we write (A|AB)
because if A
will be enough to match entire regex (depending on how rest of regex looks like) AB
will not be tested. So if you have regex like (A|AB)
then for string AAB
you will find two matches which will be A
and A
, not A
and AB
. That is why we usually try to write it from most specific to less specific like (AB|A)
(or in case of literals you can try to order them based on their length).
Upvotes: 1
Reputation: 887
The \s+
at the end clashes with \s+
at the beginning of the pattern. They are greedy, which means it matches up to the word MOVE
, leaving no white-space to the left of it, which means it doesn't match.
Change both \s+
to \s+?
and MOVE
matches. But be aware that it means you're requiring all captured groups to have their own 1-or-more white-space characters. A word boundary or lookaround can solve this.
Upvotes: 3
Reputation: 59978
Instead of using \s+
you can use \b
Word Boundaries to match any word in the group, so your regex should look like this :
\\b(ACCEPT|...|WHEN)\\b
Outputs
------------------------------------
ELSE
(ELSE)
MOVE
(MOVE)
COMPUTE
(COMPUTE)
------------------------------------
Upvotes: 2