Reputation: 726
We occasionally encounter some extremely large PDFs filled with full page, high resolution images (the result of document scanning). For example, I have a 1.7GB PDF with 3500+ images. Loading the document takes about 50s but counting the images takes about 15 minutes.
I'm sure this is because the image bytes are read as a part of the API calls. Is there way to extract the image count without actually reading the image bytes?
PDFBox version: 2.0.2
Example Code:
@Test
public void imageCountIsCorrect() throws Exception {
PDDocument pdf = readPdf();
try {
assertEquals(3558, countImages(pdf));
// assertEquals(3558, countImagesWithExtractor(pdf));
} finally {
if (pdf != null) {
pdf.close();
}
}
}
protected PDDocument readPdf() throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
FileInputStream stream = new FileInputStream("large.pdf");
PDDocument pdf;
try {
pdf = PDDocument.load(stream, MemoryUsageSetting.setupMixed(1024 * 1024 * 250));
} finally {
stream.close();
}
stopWatch.stop();
log.info("PDF loaded: time={}s", stopWatch.getTime() / 1000);
return pdf;
}
protected int countImages(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
int imageCount = 0;
for (PDPage pdPage : pdf.getPages()) {
PDResources pdResources = pdPage.getResources();
for (COSName cosName : pdResources.getXObjectNames()) {
PDXObject xobject = pdResources.getXObject(cosName);
if (xobject instanceof PDImageXObject) {
imageCount++;
if (imageCount % 100 == 0) {
log.info("Found image: #" + imageCount);
}
}
}
}
stopWatch.stop();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}
If I change the countImages method to rely on the COSName, the count completes in less than 1s but I'm a little uncertain about relying on the prefix of the name. This appears to be a byproduct of the pdf encoder and not PDFBox (I couldn't find any reference to it in their code):
if (cosName.getName().startsWith("QuickPDFIm")) {
imageCount++;
}
Upvotes: 0
Views: 1660
Reputation: 726
So the previous approach had some additional flaws (could miss inline images, etc.). Thanks mkl and Tilman Hausherr for the feedback!
TIL - PDF object streams contain useful operator codes!
My new approach extends PDFStreamEngine and increments an imageCount for every 'Do' (draw object) operator found in the PDF content stream. The image count only takes a few hundred milliseconds with this method:
public class PdfImageCounter extends PDFStreamEngine {
protected int documentImageCount = 0;
public int getDocumentImageCount() {
return documentImageCount;
}
public PdfImageCounter() {
addOperator(new OperatorProcessor() {
@Override
public void process(Operator operator, List<COSBase> arguments) throws IOException {
if (arguments.size() < 1) {
throw new MissingOperandException(operator, arguments);
}
if (isImage(arguments.get(0))) {
documentImageCount++;
}
}
protected Boolean isImage(COSBase base) {
return (base instanceof COSName) &&
context.getResources().isImageXObject((COSName)base);
}
@Override
public String getName() {
return "Do";
}
});
}
}
Invoke it for each page:
protected int countImagesWithProcessor(PDDocument pdf) throws IOException {
StopWatch stopWatch = new StopWatch();
stopWatch.start();
PdfImageCounter counter = new PdfImageCounter();
for (PDPage pdPage : pdf.getPages()) {
counter.processPage(pdPage);
}
stopWatch.stop();
int imageCount = counter.getDocumentImageCount();
log.info("Images counted: time={}s,imageCount={}", stopWatch.getTime() / 1000, imageCount);
return imageCount;
}
Upvotes: 0