PDFBox 2.0.3/Java 7 - OOM Error when importing page from one PDF to another

Question

I have some code that reviews every page in a large PDF (20,000+ pages) and if that page contains a certain String, then it imports that page to another PDF.

Due to the number of occurrences, the PDF that it's being imported into grows almost as large as the source PDF - When it gets too large, it bombs out with the below exception:

Exception in thread "main" java.lang.OutofMemoryError: Java heap space
at java.utils.Arrays.copyOf (Unknown Source)
at java.io.ByteArrayOutputStream.toByteArray (Unknown Source)
at org.apache.pdfbox.cos.COSOutputStream.close(COSOutputStream.java:87)
at java.io.FilterOutputStream.close(Unknown Source)
at org.apache.pdfbox.cos.COSStream$1.close(COSStream.java:223)
at org.apache.pdfbox.pdmodel.common.PDStream.(PDStream.java:138)
at org.apache.pdfbox.pdmodel.common.PDStream.(PDStream.java:104)
at org.apache.pdfbox.pdfmodel.PDDocument.importPage(PDDocument.java:562)
at ExtractPage.extractString(ExtractPage.java:57)
at RunApp.run(RunApp.java:15)

I have researched the issue and it looks like the use of a temp file for streaming could resolve my problem. However, i just can't work out how to implement it into my code.

I do have a work around where i would batch the pages into seperate files and then merge them afterwards, using the soultion mentioned here - However, it certainley would be much more effcient and cleaner to avoid this.

Please see a summary of my code below:

File sourceFile = new File (C:\Temp\extractFROM.pdf);
PDDocument sourceDocument = PDDocument.load(SourceFile, MemoryUsageSetting.setupTempFileOnly();
PDPageTree sourcePageTree = sourceDocument.getDocumentCatalog().getPages(); 
PDDocument tempDocument = new PDDocument (MemoryUsageSetting.setupTempFileOnly())

for (PDPage page : sourcePageTree) {
// Code to extract page text and confirm if contains String
if (above psuedo code is true) {
tempDocument.importPage(page);
}
}

tempDocument.save(sourceFile);

Once it's exported around 7000 or so pages, that's when it bombs out at the tempDocument.importPage(page) line. It works perfectly for PDFs below that number.

Can anyone assist?

mkl · Accepted Answer

A program running into an OutofMemoryError might have a memory leak, or it might simply require more memory to run properly.

Thus, one change to try in such a situation is to simply increase the memory assigned to the program. If the program then runs without an issue, you can consider this a fix. As long as the memory assigned does not become completely unreasonable, that is...

This appears to be the case here, as the op confirmed

I have increased the heap as a run configuration to 670mb (The maximum i can secure with my client equipment) and this has successfully resolved the issue - In fact, i tried it on a PDF twice the size as the original failing PDF, and it easily managed this as well.

PDFBox 2.0.3/Java 7 - OOM Error when importing page from one PDF to another

Answers (1)

Related Questions