Adrien Ball
Adrien Ball

Reputation: 499

Java Regex capture multiple groups with groups containing others

I'm trying to build a regular expression which captures multiple groups, with some of them being contained in others. For instance, let's say I want to capture every 4-grams that follows a 'to' prefix:

input = "I want to run to get back on shape"
expectedOutput = ["run to get back", "get back on shape"]

In that case I would use this regex:

"to((?:[ ][a-zA-Z]+){4})"

But it only captures the first item in expectedOutput (with a space prefix but that's not the point). This is quite easy to solve without regex, but I'd like to know if it is possible only using regex.

Upvotes: 1

Views: 147

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627400

You can make use of a regex overlapping mstrings:

String s = "I want to run to get back on shape";
Pattern pattern = Pattern.compile("(?=\\bto\\b((?:\\s*[\\p{L}\\p{M}]+){4}))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group(1).trim()); 
} 

See IDEONE demo

The regex (?=\bto\b((?:\s*[\p{L}\p{M}]+){4})) checks each location in the string (since it is a zero width assertion) and looks for:

  • \bto\b - a whole word to
  • ((?:\s*[\p{L}\p{M}]+){4}) - Group 1 capturing 4 occurrences of
    • \s* zero or more whitespace(s)
    • [\p{L}\p{M}]+ - one or more letters or diacritics

If you want to allow capturing fewer than 4 ngrams, use a {0,4} (or {1,4} to require at least one) greedy limiting quantifier instead of {4}.

Upvotes: 1

Bahramdun Adil
Bahramdun Adil

Reputation: 6089

It is the order of groups in Regex

1       ((A)(B(C)))   // first group (surround two other inside this)
2       (A)           // second group ()
3       (B(C))        // third group (surrounded one other group)
4       (C)           // forth group ()

Upvotes: 0

Related Questions