Reputation: 499
I'm trying to build a regular expression which captures multiple groups, with some of them being contained in others. For instance, let's say I want to capture every 4-grams that follows a 'to' prefix:
input = "I want to run to get back on shape"
expectedOutput = ["run to get back", "get back on shape"]
In that case I would use this regex:
"to((?:[ ][a-zA-Z]+){4})"
But it only captures the first item in expectedOutput
(with a space prefix but that's not the point).
This is quite easy to solve without regex, but I'd like to know if it is possible only using regex.
Upvotes: 1
Views: 147
Reputation: 627400
You can make use of a regex overlapping mstrings:
String s = "I want to run to get back on shape";
Pattern pattern = Pattern.compile("(?=\\bto\\b((?:\\s*[\\p{L}\\p{M}]+){4}))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1).trim());
}
See IDEONE demo
The regex (?=\bto\b((?:\s*[\p{L}\p{M}]+){4}))
checks each location in the string (since it is a zero width assertion) and looks for:
\bto\b
- a whole word to
((?:\s*[\p{L}\p{M}]+){4})
- Group 1 capturing 4 occurrences of
\s*
zero or more whitespace(s)[\p{L}\p{M}]+
- one or more letters or diacriticsIf you want to allow capturing fewer than 4 ngrams, use a {0,4}
(or {1,4}
to require at least one) greedy limiting quantifier instead of {4}
.
Upvotes: 1
Reputation: 6089
It is the order of groups in Regex
1 ((A)(B(C))) // first group (surround two other inside this)
2 (A) // second group ()
3 (B(C)) // third group (surrounded one other group)
4 (C) // forth group ()
Upvotes: 0