Alexander Stepchkov
Alexander Stepchkov

Reputation: 755

pdfbox - pdf increase size after converting to grayscale

I need to convert scanned PDF to grayscale PDF. I found 2 solutions for that.

First one is to just use renderImage

private void convertToGray() throws IOException {
    File pdfFile = new File(PATH);
    try (PDDocument originalPdf = PDDocument.load(pdfFile);
         PDDocument doc = new PDDocument()) {
        LOGGER.info("Current heap after loading file: {}", Runtime.getRuntime().totalMemory());
        PDFRenderer pdfRenderer = new PDFRenderer(originalPdf);
        for (int pageNum = 0; pageNum < originalPdf.getNumberOfPages(); pageNum++) {
//          PDImageXObject pdImage = LosslessFactory.createFromImage(doc, bufferedImage);
            BufferedImage grayImage = pdfRenderer.renderImageWithDPI(pageNum, 300F, ImageType.GRAY);
            PDImageXObject pdImage = JPEGFactory.createFromImage(doc, grayImage);
            float pageWight = originalPdf.getPage(pageNum).getMediaBox().getWidth();
            float pageHeight = originalPdf.getPage(pageNum).getMediaBox().getHeight();
            PDPage page = new PDPage(new PDRectangle(pageWight, pageHeight));
            doc.addPage(page);
            try (PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
                contentStream.drawImage(pdImage, 0F, 0F, pageWight, pageHeight);
            }
        }
        doc.save(NEW_PATH);
    }
}

But this leads to increase size of the file (because some PDFs has less DPI than 300.

Second one is to just replace existing image with gray analog

private void convertByImageToGray() throws IOException {
    File pdfFile = new File(PATH);
    try (PDDocument document = PDDocument.load(pdfFile)) {
        List<COSObject> objects = document.getDocument().getObjectsByType(COSName.IMAGE);
        for (COSObject object : objects) {
            LOGGER.info("Class: {}; {}", object.getClass(), object.toString());
        }
        for (int pageNum = 0; pageNum < document.getNumberOfPages(); pageNum++) {
            PDPage page = document.getPage(pageNum);
            replaceImage(document, page);
        }
        document.save(NEW_PATH);
    }
}

private void replaceImage(PDDocument document, PDPage page) throws IOException {
    PDResources resources = page.getResources();
    Iterable<COSName> xObjectNames = resources.getXObjectNames();
    if (xObjectNames != null) {
        for (COSName xObjectName : xObjectNames) {
            PDXObject object = resources.getXObject(xObjectName);
            if (object instanceof PDImageXObject) {
                PDImageXObject img1 = (PDImageXObject) object;
                BufferedImage bufferedImage1 = img1.getImage();
                BufferedImage grayBufferedImage = convertBufferedImageToGray(bufferedImage1);
//                    PDImageXObject grayImage = JPEGFactory.createFromImage(document, grayBufferedImage);
                PDImageXObject grayImage = LosslessFactory.createFromImage(document, grayBufferedImage);
                resources.put(xObjectName, grayImage);
            }
        }
    }
}

private static BufferedImage convertBufferedImageToGray(BufferedImage sourceImg) {
    ColorSpace cs = ColorSpace.getInstance(ColorSpace.CS_GRAY);
    ColorConvertOp op = new ColorConvertOp(sourceImg.getColorModel().getColorSpace(), cs, null);
    op.filter(sourceImg, sourceImg);
    return sourceImg;
}

But still some files increase in size like 3 times (even they were already grayscale; interesting that int this case JPEGFactory produces larger files than LosslessFactory). All images in grayscale PDF have the same size as original ones. And I don't understand why.

Maybe there is a better way to make grayscale PDF with predictable size (except ghostscript)?

UPDATE: I've just realized that the issue is with creating PDF from image. It does not compress as well.

For example, I have dummy 1-page scan file that is less than 1 Mb. But if I get image from it (directly copying via Acrobat Reader to Paint, or via code above) it size is ~8-10 Mb depending on the method. And if I create new PDF from this image it's barely compressed. Here is example code:

File pdfFile = new File(FULL_FILE);
try (PDDocument document = PDDocument.load(pdfFile)) {
    PDPage page = new PDPage();
    document.addPage(page);
    PDImageXObject pdImage = PDImageXObject.createFromFile("example.png", document);
    try (PDPageContentStream contents = new PDPageContentStream(document, page)) {
        contents.drawImage(pdImage, 0F, 0F);
    }
    document.save(FULL_FILE_NEW);
}

Upvotes: 1

Views: 1214

Answers (1)

Vinit Pillai
Vinit Pillai

Reputation: 516

Yes LosslessFactory produces smaller files compared to JPEGFactory

In the below link there are different methods to try and achieve the same goal. Overall the best quality gray scale image was the one from Option 6, however this was by no means the fastest (I myself used Option 4). Comparisons are also provided for you to choose

This link contains possible ways to convert color images to black. It helped me a lot. Let me know if it works for you and approve my answer if it helped.

Upvotes: 2

Related Questions