jermriddled
jermriddled

Reputation: 21

Regex captures everything I want in online tester, but captures extra characters in Java

I have a simple program that scans a PDF and returns a specific section of text.

public static String returnReportPlan(File file) throws IOException {

    PDDocument document = PDDocument.load(file);
    PDFTextStripper PDFstripper = new PDFTextStripper();
    String page = PDFstripper.getText(document);
    Matcher planMatcher = Pattern.compile(Constants.REPORT_PLAN_REGEX).matcher(page);
    if (planMatcher.find()) return planMatcher.group(1);
    return null;

}

The regex is:

Fruits\n([\w\s\W]+)\n

The PDF has weirdly placed new line characters. The capture group has words, numbers, and special characters. Here's an example:

Fruits
Banana apple/grapple Cherry08 orange45 strawberry
Grape strawberry-lemonade cherry
Cherry08

Basically, I want group 1 to capture everything except the first line("Fruits") and the last line("Cherry08"). The last line is ALWAYS a COPY of a mix of letters and numbers, one or multiple times, that shows up in the area I want to capture, so for instance there might be one or three "Cherry08" in the capture group, and only one on the last line, or there might be one in the capture group and two on the last line. The actual data is not fruits but sensitive information so I don't want to post it here.

In online regex testers, it captures exactly what I want, but when I run it through my code in Java it always includes the last line in the capture group, so that I end up with everything except Fruits basically. What am I doing wrong?

EDIT: forgot to mention that in order to make the match work at all in Java, I have to add \r before the first \n, not sure if that's relevant.

Upvotes: 1

Views: 41

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627086

When you read the file it ends with a newline char.

You should use

text = text.trim();

before you run your regex on the text variable.

Then, you can simply use

Fruits\n([\w\W]+)\n

No need adding \s to the character class.

Or, you may use . with the Pattern.DOTALL modifier, or (?s) embedded flag option:

(?s)Fruits\n(.+)\n

Upvotes: 2

Related Questions