Reputation: 133

PDFbox loading large files

I'm trying to convert the first page of a pdf file to image using PDFBox. When i'm loading a large pdf file i get an exception.

code:

    PDDocument doc;
    try {
        InputStream input  = new URL("http://www.jewishfederations.org/local_includes/downloads/39497.pdf").openStream();
        doc = PDDocument.load(input);
        PDPage firstPage = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
        BufferedImage image =firstPage.convertToImage();
        File outputfile = new File("image2.png");
        ImageIO.write(image, "png", outputfile);
        input.close();
        doc.close();

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

exception:

org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 72435 is wrong. Fall back to reading stream until 'endstream'.
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72435 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:554)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
    at Worker.main(Worker.java:27)
Caused by: java.io.IOException: Push back buffer is full
    at java.io.PushbackInputStream.unread(Unknown Source)
    at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:144)
    at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:133)
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:550)
    ... 5 more

Upvotes: 7

Answers (4)

Yaniv Levy

Reputation: 78

In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.

Good Luck

Upvotes: 1

James Oravec

Reputation: 20381

I had a similar issue, which I thought was related to a large pdf file based on the error, however it turned out it was not. It turned out to be a corrupt pdf file.

For our use case, we had a pdf template file (which we populate its form values programmatically) as a resource in our project that is cooked into our war.

The exception I was seeing for reference: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 480478 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize. We added the property and then ran things again and we got a different issue.

The next stack trace stated "Could not read embedded TTF for font TimesNewRoman,Bold". It took us a while, however after exploding the war and trying to open the pdf file in the war, we noticed that it was corrupt, but the pdf file that was in source was not corrupt and could be opened without issues.

The root cause of our issue was that we added "filtering" in our pom for our resource folder. We did this so that we could use some reflection to get some values in our health check page, but that corrupted the pdf file, which we figured out from the following reference: https://bitbucket.org/petermr/xhtml2stm/issues/12/pdf-files-are-being-corrupted-at-some

Below is an example of the filtering we setup that bit us:

<resources>
    <resource>
        <directory>src/main/resources</directory>
        <filtering>true</filtering>
    </resource>
</resources>

Our solution was to remove this from our pom and rework how we got the information for our health page.

Upvotes: 1

Tilman Hausherr

Reputation: 18861

An alternative solution for the 1.8.* PDFBox versions is to use the non-sequential parser. In that case, the code would not be

doc = PDDocument.load(input);

but

doc = PDDocument.loadNonSeq(input, null);

that parser (which will be the only one in the upcoming 2.0 version) is independent of the size of a pushback buffer.

Upvotes: 4

guyfleeman

Reputation: 473

First, find the current buffer size:

System.out.println(System.getProperty("org.apache.pdfbox.baseParser.pushBackSize"));

Now that you have a baseline, do exactly what it suggests. Increase the buffer size above what you just printed out using this:

System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "<buffer size>");

Keep increasing the buffer size until it works. Hopefully you won't run out of memory, if you do increase heap.

This is how you set system properties at runtime. You could also pass it as an argument, but I find setting near the beginning of main will do the trick and makes it easier for future developers to maintain the project.

For whatever reason, with large files you don't have a big enough buffer to load the page. Maybe the page is loaded into a buffer before or while it's rendered into an image. My guess is that the DPI in the PDF is very high and can't fit in the buffer.

Upvotes: 2

PDFbox loading large files

Answers (4)

Related Questions