Reputation: 21
I have a simple program that scans a PDF and returns a specific section of text.
public static String returnReportPlan(File file) throws IOException {
PDDocument document = PDDocument.load(file);
PDFTextStripper PDFstripper = new PDFTextStripper();
String page = PDFstripper.getText(document);
Matcher planMatcher = Pattern.compile(Constants.REPORT_PLAN_REGEX).matcher(page);
if (planMatcher.find()) return planMatcher.group(1);
return null;
}
The regex is:
Fruits\n([\w\s\W]+)\n
The PDF has weirdly placed new line characters. The capture group has words, numbers, and special characters. Here's an example:
Fruits
Banana apple/grapple Cherry08 orange45 strawberry
Grape strawberry-lemonade cherry
Cherry08
Basically, I want group 1 to capture everything except the first line("Fruits") and the last line("Cherry08"). The last line is ALWAYS a COPY of a mix of letters and numbers, one or multiple times, that shows up in the area I want to capture, so for instance there might be one or three "Cherry08" in the capture group, and only one on the last line, or there might be one in the capture group and two on the last line. The actual data is not fruits but sensitive information so I don't want to post it here.
In online regex testers, it captures exactly what I want, but when I run it through my code in Java it always includes the last line in the capture group, so that I end up with everything except Fruits basically. What am I doing wrong?
EDIT: forgot to mention that in order to make the match work at all in Java, I have to add \r before the first \n, not sure if that's relevant.
Upvotes: 1
Views: 41
Reputation: 627086
When you read the file it ends with a newline char.
You should use
text = text.trim();
before you run your regex on the text
variable.
Then, you can simply use
Fruits\n([\w\W]+)\n
No need adding \s
to the character class.
Or, you may use .
with the Pattern.DOTALL
modifier, or (?s)
embedded flag option:
(?s)Fruits\n(.+)\n
Upvotes: 2