PNS
PNS

Reputation: 19905

Java regex to extract text sequences across multiple lines

Given an excerpt of text like

Preface (optional, up to multiple lines)
Main : sequence1
   sequence2
   sequence3
   sequence4
Epilogue (optional, up to multiple lines)

which Java regular expression could be used to extract all the sequences (i.e. sequence1, sequence2, sequence3, sequence4 above)? For example, a Matcher.find() loop?

Each "sequence" is preceded by and may also contain 0 or more white spaces (including tabs).

The following regex

(?m).*Main(?:[ |t]+:(?:[ |t]+(\S+)[\r\n])+

only yields the first sequence (sequence1).

Upvotes: 4

Views: 324

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626804

You may use the following regex:

(?m)(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*)(\S+)\r?\n?

Details:

  • (?m) - multiline mode on
  • (?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*) - either of the two:
    • \G(?!\A)[^\S\r\n]+ - end of the previous successful match (\G(?!\A)) and then 1+ horizontal whitespaces ([^\S\r\n]+, can be replaced with [\p{Zs}\t]+ or [\s&&[^\r\n]]+)
    • | - or
    • ^Main\s*:\s* - start of a line, Main, 0+ whitespaces, :, 0+ whitespaces
  • (\S+) - Group 1 capturing 1+ non-whitespace symbols
  • \r?\n? - an optional CR and an optional LF.

See the Java code below:

String p = "(?m)(?:\\G(?!\\A)[^\\S\r\n]+|^Main\\s*:\\s*)(\\S+)\r?\n?";
String s = "Preface (optional, up to multiple lines)...\nMain : sequence1\n   sequence2\n   sequence3\n   sequence4\nEpilogue (optional, up to multiple lines)";
Matcher m = Pattern.compile(p).matcher(s);
while(m.find()) {
    System.out.println(m.group(1));
}

Upvotes: 3

Related Questions