Arun Gowda
Arun Gowda

Reputation: 3500

Match longest string in Regex OR in case of common substring

In a regex OR, When there are multiple inputs with a common prefix, The regex will match the first input in Regex OR instead of longest match.

For example, for the regular expression regex = (KA|KARNATAKA) and input = KARNATAKA the output will be 2 matches match1 =KA and match2 = KA.

But what I want is complete longest possible match out of given input in Regex OR which is match1 = KARNATAKA in my given example.

Here is the example in a regex client

So what I am doing right now is, I am sorting the input in Regex OR by length in descending order.

My question is, Can we specify in the regex itself to match the longest possible String? Or is sorting the only way to do it?

I have already refered this question and I don't see a solution other than sorting

Upvotes: 1

Views: 779

Answers (2)

Jai
Jai

Reputation: 8363

You can create a helper method for this:

public final class PatternHelper {
    public static Pattern compileSortedOr(String regex) {
        Matcher matcher = Pattern.compile("(.*)\\((.*\\|.*)\\)(.*)").matcher(regex);

        if (matcher.matches()) {
            List<String> conditions = Arrays.asList(matcher.group(2).split("\\|"));
            List<String> sortedConditions = conditions.stream()
                                                      .sorted((c1, c2) -> c2.length() - c1.length())
                                                      .collect(Collectors.toList());

            return Pattern.compile(matcher.group(1) +
                                       "(" +
                                       String.join("|", sortedConditions) +
                                       ")" +
                                       matcher.group(3));
        }

        return Pattern.compile(regex);
    }
}

Matcher matcher = PatternHelper.compileSortedOr("(KA|KARNATAKA)").matcher("KARNATAKA");
if (matcher.matches()) {
    System.out.println(matcher.group(1));
}

Output:

KARNATAKA

P.S. This only works for simple expressions without nested brackets. You would need to tweak if you are expecting much complex expressions.

Upvotes: 0

The Scientific Method
The Scientific Method

Reputation: 2436

You can use word boundary (\b) to avoid matching prefixes

For the case you mentioned: the following regex will only match KA or KARNATAKA

(\bKA\b|\bKARNATAKA\b)

Try here

Upvotes: 1

Related Questions