jackalope
jackalope

Reputation: 1564

How to match an maximum length Regex in java

public static void main(String[] args) {

        Pattern compile = Pattern
                .compile("[0-9]{1,}[A-Za-z]{1,}|[A-Za-z][0-9]{1,}|[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|[0-9][0-9\\-]{4,}|[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]+");
        Matcher matcher = compile.matcher("i5-2450M");
        matcher.find();
        System.out.println(matcher.group(0));
    }

I assume this should return i5-2450M but it returns i5 actually

Upvotes: 1

Views: 1165

Answers (2)

Eduard Seregin
Eduard Seregin

Reputation: 1

Try to iterate over the matches (i.e. while matcher(text).find())

Upvotes: 0

user166390
user166390

Reputation:

The problem is that the first alternation that matches is used.

In this case the 2nd alternation ([A-Za-z][0-9]{1,}, which matches i5) "shadows" any following alternation.

// doesn't match
[0-9]{1,}[A-Za-z]{1,}|
// matches "i5"
[A-Za-z][0-9]{1,}|
// the following are never even checked, because of the previous match
[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|
[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|
[0-9][0-9\\-]{4,}|
[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]

(Please note, that there are likely serious issues with the regular expression in the post -- for instance, 0---# would be matched by the last rule -- which should be addressed, but are not below due to not being the "fundamental" problem of the alternation behavior.)

To fix this issue, arrange the alternations with the most specific first. In this case it would be putting the 2nd alternation below the other alternation entries. (Also review the other alternations and the interactions; perhaps the entire regular expression can be simplified?)

The use of a simple word boundary (\b) will not work here because - is considered a non-word character. However, depending upon the meaning of the regular expression, anchors ($ and ^) could be used around the alternation: e.g. ^existing_regex$. This doesn't change the behavior of the alternation, but it would cause the initial match of i5 to be backtracked, and thereby causing subsequent alternation entries to be considered, due to not being able to match the end-of-input immediately after the alternation group.


From Java regex alternation operator "|" behavior seems broken:

Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

(The accepted answer in this question uses word boundaries.)

From Pattern:

The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.

Upvotes: 4

Related Questions