Reputation: 23
I have a requirement where I want to extract the content from a file which can have multiple occurrences of the pattern. Basically files containing multiple sections and I want to extra each section. The extracted content should include the string matching the pattern
Eg: File content
01
Community based Index1-
...some text....
...some text..
Conclusion: The significant increase of testing
...
some text.
02
Community based Index2-
.some text.
.some text.
Conclusion: The significant increase of testing
...
...<End of para>
:
:
I am trying with the following pattern but it is not working
String patternStart = "\\d{2}[^\\d.,)][\\s:-]?[\\r\\n][A-Z]";
String patternEnd = "Conclusion.*(\\n.*)*"; \\ including the entire para
I am trying with pattern matcher but it is not working, I am getting no match found.
String regexString = Pattern.quote(patternStart) + "(.*?)" + Pattern.quote(patternEnd);
Pattern pattern = Pattern.compile(regexString);
while (matcher.find()) {
String textInBetween = matcher.group(1);
}
Upvotes: 2
Views: 77
Reputation: 163632
You could use a single pattern to extract the whole section:
^\d+(?:\R(?!\d+\R|Conclusion:).*)*\RConclusion:\h+(.*(?:\R(?!\d+\R|Conclusion:).*)*)
Explanation
^
Start of string\d+
Match 1+ digits(?:
Non capture group
\R(?!\d+\R|Conclusion:).*
Match a unicode newline sequence and the rest of the line if it does not start with either 1+ digits and a newline or Conclusion:)*
Close group and repeat 0+ times to match all the lines\RConclusion:\h+
Match a newline and Conclusion: followed by 1+ horizontal whitespace chars(
Capture group 1
.*
Match the whole line(?:\R(?!\d+\R|Conclusion:).*)*
Repeat 0+ times matching all lines that do not start with either 1+ digits followed by a newline or Conclusion:)
Close group 1In Java
String regex = "^\\d+(?:\\R(?!\\d+\\R|Conclusion:).*)*\\RConclusion: (.*(?:\\R(?!\\d+\\R|Conclusion:).*)*)";
See a Java demo
Upvotes: 1