user390749
user390749

Reputation:

Find a single regex to get words of 3 or more characters between two specific words

I'm looking for a Regex (used in Java) to get all the 3 characters or more words between the following words (Peach, Apple) in all the following sentences:

Peach are nice fruits. Apple are not.

At this moment, I'm using the following parts:

\w{3,}\b

to get all the 3+ more characters words. I'm using positive and negative look behind to get the words between Peach and Apple like this:

(?<=Peach).*(?=Apple).

I can't use two regex and I can't use substring or any others techniques. Only one single regex to extract the words.

Upvotes: 1

Views: 191

Answers (2)

anubhava
anubhava

Reputation: 784998

You can use \G for this in lookbehind:

Pattern p = Pattern.compile("(?<=(?:\\bPeach\\b|\\G)\\W).*?\\b((?!Apple\\b)\\w{3,})\\b");

String msg = "Peach a nice family of fruits. Apple are not.";
Matcher m = p.matcher( msg );

while (m.find()) {
    System.out.println( m.group(1) );
}
  • \G asserts position at the end of the previous match or the start of the string for the first match.
  • (?<=(?:\\bPeach\\b|\\G)\|W) will assert either literal "Peach " or \G in lookbehind
  • (?!Apple\\b) will make sure full word Apple is not ahead of the current position
  • \\b\\w{3,}\\b will match a full word with 3 or more characters after 0 or more arbitrary characters.

Output:

nice
family
fruits

If there are multiple Peach and Apple in the string then you can use:

String msg = "Peach, a nice family of fruits. Apple are not. Another Peach foo bar is here Apple end.";
Pattern p = Pattern.compile(
      "(?:(?<=\\bPeach\\b|\\G)\\W)(?:(?!\\bApple\\b).)*?\\b((?!Apple\\b)\\w{3,})\\b");

Matcher m = p.matcher(msg);
while (m.find()) {
    System.out.println(m.group(1));
}

Output

nice
family
fruits
foo
bar
here

RegEx Demo


This clumsy looking regex will probably take care of many edge cases but it should be used only if requirements are for nested/unbalanced Peach/Apple pair:

(?:(?<=\bPeach\b(?!(?:(?!\bApple\b).)*?\bPeach\b)|\G)\W)(?:(?!\bApple\b).)*?\b((?!Apple\b)\w{3,})\b

RegEx Demo 2

Upvotes: 2

nhahtdh
nhahtdh

Reputation: 56809

Instead of writing a single regex that does all the work, you can also do it in two steps:

  1. Match the substrings between the markers.
  2. For each substrings, extract works with more than 3 characters.

This approach would result in simpler regex, and less prone to bug in edge cases.

Using the string below as example:

Peach, a nice family of fruits Apple are not. Another Peach foo bar is here Apple. Apple Peach inside Peach then Apple Peach no no Apple

I use the regex (?<=\bPeach\b).*?(?=\bApple\b) to pick out the substrings , a nice family of fruits ,  foo bar is here ,  inside Peach then ,  no no , then extract the words with 3 or more characters from these substrings.

The regex above is only an example. Depending on your requirement in edge cases, you can customize the regex to extract only the substrings which you want to extract words from.

You can change the regex above to (?<=\bPeach\b).*(?=\bApple\b) to get everything between the first Peach and the last Apple.

The output for the example above is:

[nice, family, fruits, foo, bar, here, inside, Peach, then]

Depending on your need, you can change the regex as suggested above, or just simply filter the output.

Full example code:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;


class SO32415895 {
    public static void main(String[] args) {
        String input = "Peach, a nice family of fruits Apple are not. Another Peach foo bar is here Apple. Apple Peach inside Peach then Apple Peach no no Apple";

        List<String> inBetween = findAll("(?<=\\bPeach\\b).*?(?=\\bApple\\b)", input);

        List<String> words = new ArrayList<>();
        Pattern WORD_PATTERN = Pattern.compile("\\b\\w{3,}\\b");

        for (String s: inBetween) {
            words.addAll(findAll(WORD_PATTERN, s));
        }

        System.out.println(words);
    }

    public static List<String> findAll(String pattern, String input) throws PatternSyntaxException {
        return findAll(Pattern.compile(pattern), input);
    }

    public static List<String> findAll(Pattern pattern, String input) {
        Matcher m = pattern.matcher(input);
        List<String> out = new ArrayList<>();

        while (m.find()) {
            out.add(m.group());
        }

        return out;
    }
}

Upvotes: 0

Related Questions