Reputation:
I'm looking for a Regex (used in Java) to get all the 3 characters or more words between the following words (Peach, Apple) in all the following sentences:
Peach are nice fruits. Apple are not.
At this moment, I'm using the following parts:
\w{3,}\b
to get all the 3+ more characters words. I'm using positive and negative look behind to get the words between Peach
and Apple
like this:
(?<=Peach).*(?=Apple).
I can't use two regex and I can't use substring or any others techniques. Only one single regex to extract the words.
Upvotes: 1
Views: 191
Reputation: 784998
You can use \G
for this in lookbehind:
Pattern p = Pattern.compile("(?<=(?:\\bPeach\\b|\\G)\\W).*?\\b((?!Apple\\b)\\w{3,})\\b");
String msg = "Peach a nice family of fruits. Apple are not.";
Matcher m = p.matcher( msg );
while (m.find()) {
System.out.println( m.group(1) );
}
\G
asserts position at the end of the previous match or the start of the string for the first match.(?<=(?:\\bPeach\\b|\\G)\|W)
will assert either literal "Peach "
or \G
in lookbehind(?!Apple\\b)
will make sure full word Apple
is not ahead of the current position\\b\\w{3,}\\b
will match a full word with 3 or more characters after 0 or more arbitrary characters.Output:
nice
family
fruits
If there are multiple Peach and Apple in the string then you can use:
String msg = "Peach, a nice family of fruits. Apple are not. Another Peach foo bar is here Apple end.";
Pattern p = Pattern.compile(
"(?:(?<=\\bPeach\\b|\\G)\\W)(?:(?!\\bApple\\b).)*?\\b((?!Apple\\b)\\w{3,})\\b");
Matcher m = p.matcher(msg);
while (m.find()) {
System.out.println(m.group(1));
}
Output
nice
family
fruits
foo
bar
here
This clumsy looking regex will probably take care of many edge cases but it should be used only if requirements are for nested/unbalanced Peach/Apple
pair:
(?:(?<=\bPeach\b(?!(?:(?!\bApple\b).)*?\bPeach\b)|\G)\W)(?:(?!\bApple\b).)*?\b((?!Apple\b)\w{3,})\b
Upvotes: 2
Reputation: 56809
Instead of writing a single regex that does all the work, you can also do it in two steps:
This approach would result in simpler regex, and less prone to bug in edge cases.
Using the string below as example:
Peach, a nice family of fruits Apple are not. Another Peach foo bar is here Apple. Apple Peach inside Peach then Apple Peach no no Apple
I use the regex (?<=\bPeach\b).*?(?=\bApple\b)
to pick out the substrings
, a nice family of fruits
, foo bar is here
, inside Peach then
, no no
, then extract the words with 3 or more characters from these substrings.
The regex above is only an example. Depending on your requirement in edge cases, you can customize the regex to extract only the substrings which you want to extract words from.
You can change the regex above to (?<=\bPeach\b).*(?=\bApple\b)
to get everything between the first Peach and the last Apple.
The output for the example above is:
[nice, family, fruits, foo, bar, here, inside, Peach, then]
Depending on your need, you can change the regex as suggested above, or just simply filter the output.
Full example code:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
class SO32415895 {
public static void main(String[] args) {
String input = "Peach, a nice family of fruits Apple are not. Another Peach foo bar is here Apple. Apple Peach inside Peach then Apple Peach no no Apple";
List<String> inBetween = findAll("(?<=\\bPeach\\b).*?(?=\\bApple\\b)", input);
List<String> words = new ArrayList<>();
Pattern WORD_PATTERN = Pattern.compile("\\b\\w{3,}\\b");
for (String s: inBetween) {
words.addAll(findAll(WORD_PATTERN, s));
}
System.out.println(words);
}
public static List<String> findAll(String pattern, String input) throws PatternSyntaxException {
return findAll(Pattern.compile(pattern), input);
}
public static List<String> findAll(Pattern pattern, String input) {
Matcher m = pattern.matcher(input);
List<String> out = new ArrayList<>();
while (m.find()) {
out.add(m.group());
}
return out;
}
}
Upvotes: 0