Niranjan C
Niranjan C

Reputation: 23

Java -- Best way to grab ALL Strings between two regex?

I have a requirement where I want to extract the content from a file which can have multiple occurrences of the pattern. Basically files containing multiple sections and I want to extra each section. The extracted content should include the string matching the pattern

Eg: File content

01
Community based Index1- 
...some text....
...some text..
Conclusion: The significant increase of testing 
...
some text. 

02
Community based Index2- 
.some text.
.some text.
Conclusion: The significant increase of testing 
...
...<End of para> 
:
:

I am trying with the following pattern but it is not working

String patternStart = "\\d{2}[^\\d.,)][\\s:-]?[\\r\\n][A-Z]";
String patternEnd = "Conclusion.*(\\n.*)*"; \\ including the entire para

I am trying with pattern matcher but it is not working, I am getting no match found.

 String regexString = Pattern.quote(patternStart)  + "(.*?)" + Pattern.quote(patternEnd);
 Pattern pattern = Pattern.compile(regexString);
 while (matcher.find()) {
            String textInBetween = matcher.group(1);
  }

Upvotes: 2

Views: 77

Answers (1)

The fourth bird
The fourth bird

Reputation: 163632

You could use a single pattern to extract the whole section:

^\d+(?:\R(?!\d+\R|Conclusion:).*)*\RConclusion:\h+(.*(?:\R(?!\d+\R|Conclusion:).*)*)

Explanation

  • ^ Start of string
  • \d+ Match 1+ digits
  • (?: Non capture group
    • \R(?!\d+\R|Conclusion:).* Match a unicode newline sequence and the rest of the line if it does not start with either 1+ digits and a newline or Conclusion:
  • )* Close group and repeat 0+ times to match all the lines
  • \RConclusion:\h+ Match a newline and Conclusion: followed by 1+ horizontal whitespace chars
  • ( Capture group 1
    • .* Match the whole line
    • (?:\R(?!\d+\R|Conclusion:).*)* Repeat 0+ times matching all lines that do not start with either 1+ digits followed by a newline or Conclusion:
  • ) Close group 1

Regex demo

In Java

String regex = "^\\d+(?:\\R(?!\\d+\\R|Conclusion:).*)*\\RConclusion: (.*(?:\\R(?!\\d+\\R|Conclusion:).*)*)";

See a Java demo

Upvotes: 1

Related Questions