Kajal
Kajal

Reputation: 739

Multiple group match in regex

I have an input string

invalidsufix\nsubadatax\nsufixpart\nsubdata1\nsomerandomn\nsubdata2\nsubdatan\nend

I want to fetch only the subdata part of it, I tried,

Pattern p = Pattern.compile('(?<=sufixpart).*?(subdata.)+.*?(?=end)',Pattern.DOTALL);

Matcher m = p.matcher(inputString);
while(m.find()){ 
            System.out.println(m.group(1)); 
        }

But I get only the first match. How can i get all the subdata, such as [subdata1,subdata2,subdata3]?

Upvotes: 3

Views: 689

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

I'd go for a simpler approach, get the blocks first with a regex like start(.*?)end and then extract all the matches from Group 1 with a mere subdata\S*-like regex.

See the Java demo:

String rx = "(?sm)^sufixpart$(.*?)^end$";
String s = "invalidsufix\nsubadatax\nsufixpart\nsubdata1\nsomerandomn\nsubdata2\nsubdatan\nend\ninvalidsufix\nsubadatax\nsufixpart\nsubdata001\nsomerandomn\nsubdata002\nsubdata00n\nend";
Pattern pattern_outer = Pattern.compile(rx);
Pattern pattern_token = Pattern.compile("(?m)^subdata\\S*$");
Matcher matcher = pattern_outer.matcher(s);
List<List<String>> res = new ArrayList<>();
while (matcher.find()){
    List<String> lst = new ArrayList<>();
    if (!matcher.group(1).isEmpty()) {                       // If Group 1 is not empty
        Matcher m = pattern_token.matcher(matcher.group(1)); // Init the second matcher
        while (m.find()) {                       // If a token is found
            lst.add(m.group(0));                 //    add it to the list
        }
    }
    res.add(lst);                                // Add the list to the result list
} 
System.out.println(res); // => [[subdata1, subdata2, subdatan], [subdata001, subdata002, subdata00n]]

Another approach is to use a \G based regex:

(?sm)(?:\G(?!\A)|^sufixpart$)(?:(?!^(?:sufixpart|end)$).)*?(subdata\S*)(?=.*?^end$)

See the regex demo

Explanation:

  • (?sm) - enables DOTALL and MULTILINE modes
  • (?:\G(?!\A)|^sufixpart$) - matches either the end of the previous successful match (\G(?!\A)) or a whole line with sufixpart text on it (^sufixpart$)
  • (?:(?!^(?:sufixpart|end)$).)*? - matches any single char that is not the starting point of a sufixpart or end that are whole lines
  • (subdata\S*) - Group 1 matching subdata and 0+ non-whitespaces
  • (?=.*?^end$) - there must be a end line after any 0+ chars.

Java demo:

String rx = "(?sm)(\\G(?!\\A)|^sufixpart$)(?:(?!^(?:sufixpart|end)$).)*?(subdata\\S*)(?=.*?^end$)";
String s = "invalidsufix\nsubadatax\nsufixpart\nsubdata1\nsomerandomn\nsubdata2\nsubdatan\nend\ninvalidsufix\nsubadatax\nsufixpart\nsubdata001\nsomerandomn\nsubdata002\nsubdata00n\nend";
Pattern pattern = Pattern.compile(rx);
Matcher matcher = pattern.matcher(s);
List<List<String>> res = new ArrayList<>();
List<String> lst = null;
while (matcher.find()){
    if (!matcher.group(1).isEmpty()) {
        if (lst != null) res.add(lst);
        lst = new ArrayList<>();
        lst.add(matcher.group(2));
    } else lst.add(matcher.group(2)); 
} 
if (lst != null) res.add(lst);
System.out.println(res); 

Upvotes: 1

Related Questions