Reputation: 133
I'm trying to convert the first page of a pdf file to image using PDFBox. When i'm loading a large pdf file i get an exception.
code:
PDDocument doc;
try {
InputStream input = new URL("http://www.jewishfederations.org/local_includes/downloads/39497.pdf").openStream();
doc = PDDocument.load(input);
PDPage firstPage = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
BufferedImage image =firstPage.convertToImage();
File outputfile = new File("image2.png");
ImageIO.write(image, "png", outputfile);
input.close();
doc.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
exception:
org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 72435 is wrong. Fall back to reading stream until 'endstream'.
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 72435 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:554)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1219)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1186)
at Worker.main(Worker.java:27)
Caused by: java.io.IOException: Push back buffer is full
at java.io.PushbackInputStream.unread(Unknown Source)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:144)
at org.apache.pdfbox.io.PushBackInputStream.unread(PushBackInputStream.java:133)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:550)
... 5 more
Upvotes: 7
Views: 9411
Reputation: 78
In the 2.0.* versions, open the PDF like this:
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.
Good Luck
Upvotes: 1
Reputation: 20381
I had a similar issue, which I thought was related to a large pdf file based on the error, however it turned out it was not. It turned out to be a corrupt pdf file.
For our use case, we had a pdf template file (which we populate its form values programmatically) as a resource in our project that is cooked into our war.
The exception I was seeing for reference: org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 480478 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
. We added the property and then ran things again and we got a different issue.
The next stack trace stated "Could not read embedded TTF for font TimesNewRoman,Bold". It took us a while, however after exploding the war and trying to open the pdf file in the war, we noticed that it was corrupt, but the pdf file that was in source was not corrupt and could be opened without issues.
The root cause of our issue was that we added "filtering" in our pom for our resource folder. We did this so that we could use some reflection to get some values in our health check page, but that corrupted the pdf file, which we figured out from the following reference: https://bitbucket.org/petermr/xhtml2stm/issues/12/pdf-files-are-being-corrupted-at-some
Below is an example of the filtering we setup that bit us:
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
Our solution was to remove this from our pom and rework how we got the information for our health page.
Upvotes: 1
Reputation: 18861
An alternative solution for the 1.8.* PDFBox versions is to use the non-sequential parser. In that case, the code would not be
doc = PDDocument.load(input);
but
doc = PDDocument.loadNonSeq(input, null);
that parser (which will be the only one in the upcoming 2.0 version) is independent of the size of a pushback buffer.
Upvotes: 4
Reputation: 473
First, find the current buffer size:
System.out.println(System.getProperty("org.apache.pdfbox.baseParser.pushBackSize"));
Now that you have a baseline, do exactly what it suggests. Increase the buffer size above what you just printed out using this:
System.setProperty("org.apache.pdfbox.baseParser.pushBackSize", "<buffer size>");
Keep increasing the buffer size until it works. Hopefully you won't run out of memory, if you do increase heap.
This is how you set system properties at runtime. You could also pass it as an argument, but I find setting near the beginning of main will do the trick and makes it easier for future developers to maintain the project.
For whatever reason, with large files you don't have a big enough buffer to load the page. Maybe the page is loaded into a buffer before or while it's rendered into an image. My guess is that the DPI in the PDF is very high and can't fit in the buffer.
Upvotes: 2