andlu
andlu

Reputation: 13

How to make text invisible in an existing PDF

I want to make all the text in an existing PDF transparent.

Option 1: select all the text, find a color property and change it to "colorless"

Or, if there is no such property

Option 2: Parse the page content Stream and all Form XObjects for that page, detect text blocks (BT/ET), and set the render mode to invisble.

This seems to be a complex operation.

Here is my example file

The following code is generating PDF(example pdf file):

    Document document = new Document(new Rectangle(width, height));
    PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(filename));
    document.open();

    PdfContentByte picCanvas = null;
    PdfContentByte txtCanvas = null;
    if (isUnderPic) {
        txtCanvas = writer.getDirectContentUnder();
        picCanvas = writer.getDirectContent();
    } else {
        txtCanvas = writer.getDirectContent();
        picCanvas = writer.getDirectContentUnder();
    }
    BaseFont bf = null;
    if (null != pageList) {

        int[] dpi = { 0, 0 };
        if (dpiType == 1) {
            dpi[0] = 300;
            dpi[1] = 300;
        } else if (dpiType == 2) {
            dpi[0] = 600;
            dpi[1] = 600;
        }

        for (int i = 0; i < pageList.size(); i++) {
            PDFPage page = pageList.get(i);
            Image pageImage = null;
            if (pdfType == 3) {
                pageImage = Image.getInstance(page.getBinImage());
            } else {
                pageImage = Image.getInstance(page.getOriImage());
            }
            if (pageImage.getWidth() > 0) {
                pageImage.scaleAbsolute(page.getWidth(), page.getHeight());
            }
            pageImage.setAbsolutePosition(0, 0);
            picCanvas.addImage(pageImage);

            if (pdfType == 2 || pdfType == 3) {
                for (PageElement ele : page.getElementList()) {
                    if (ele.getType().equals(PDFConstant.ElementType.PDF_ELEMENT_CHAR)) {
                        txtCanvas.beginText();
                        if (isColor) {
                            txtCanvas.setTextRenderingMode(PdfContentByte.TEXT_RENDER_MODE_FILL);
                            txtCanvas.setColorFill(BaseColor.RED);
                        } else {
                            txtCanvas.setTextRenderingMode(PdfContentByte.TEXT_RENDER_MODE_INVISIBLE);
                        }

                        String font = ele.getFont();
                        try {
                            bf = fonts.get(font);
                            if (null == bf) {
                                bf = BaseFont.createFont(font, "UniGB-UCS2-H", BaseFont.NOT_EMBEDDED);
                                fonts.put(font, bf);
                            }
                        } catch (Exception e) {
                            bf = BaseFont.createFont("STSong-Light", "UniGB-UCS2-H", BaseFont.NOT_EMBEDDED);
                            fonts.put(font, bf);
                        }
                        txtCanvas.setFontAndSize(bf, ele.getFontSize());
                        txtCanvas.setTextMatrix(ele.getPageX(), ele.getPageY(page.getRcInPage()));
                        txtCanvas.showText(ele.getCode());
                        txtCanvas.endText();
                    }
                }
            }

            if (StringUtils.isNotBlank(cutPath)) {
                for (PageElement ele : page.getElementList()) {
                    if (ele.getType().equals(PDFConstant.ElementType.PDF_ELEMENT_PIC) && StringUtils.isNotBlank(ele.getCutPicSrc())) {
                        ImageTools.cutPic(ele.getRcInImage(), page.getOriImage(), ele.getCutPicSrc(), dpi);
                    }
                }
            }
            if (pdfType == 3) {
                logger.debug("pdfType == 3");
                for (PageElement ele : page.getElementList()) {
                    if (ele.getType().equals(PDFConstant.ElementType.PDF_ELEMENT_PIC) && StringUtils.isNotBlank(ele.getCutPicSrc())) {
                        if (new File(ele.getCutPicSrc()).exists()) {
                            Image cutCover = Image.getInstance(ImageTools.drawImage((int) ele.getWidth(), (int) ele.getHeight()));
                            if (cutCover.getWidth() > 0) {
                                cutCover.scaleAbsolute(ele.getWidth(), ele.getHeight());
                            }
                            cutCover.setAbsolutePosition(ele.getPageX(), ele.getPageY(page.getRcInPage()));
                            picCanvas.addImage(cutCover);
                            Image pic = Image.getInstance(ele.getCutPicSrc());
                            if (pic.getWidth() > 0) {
                                pic.scaleAbsolute(ele.getWidth(), ele.getHeight());
                            }
                            pic.setAbsolutePosition(ele.getPageX(), ele.getPageY(page.getRcInPage()));
                            picCanvas.addImage(pic);
                        }
                    }
                }
            }
            if (i + 1 < pageList.size()) {
                document.setPageSize(new Rectangle(pageList.get(i + 1).getWidth(), pageList.get(i + 1).getHeight()));
            } else {
                document.setPageSize(new Rectangle(pageList.get(i).getWidth(), pageList.get(i).getHeight()));
            }
            document.newPage();
        }
    }
    document.close();

Upvotes: 1

Views: 1329

Answers (1)

Bruno Lowagie
Bruno Lowagie

Reputation: 77606

I've taken a look at your PDF and I see that the PDF is a scanned image. The text isn't really text: it consists of an image. Your question is invalid because it assumes that the text consists of vector data (defined using PDF syntax, such as BT and ET). In reality, the text is a bunch of pixels and any pixel doesn't know whether it belongs to a text glyph or an image. In short: you're using the wrong approach. You are trying to solve a problem using PDF software whereas you should be using a tool that manipulates raster images.

This is the image I extracted from the PDF:

enter image description here

The OP claims that there are two layers: one with an image, one with text. That may very well be true, but the image also contains rasterized text and it is impossible to remove that text from the image by changing the PDF syntax.

You may be able to cover the text if you know the coordinates, but that will largely depend on the accuracy of the OCR operation.

If your requirement is not to cover the text in the image, but the text of the vector layer, it's sufficient to add the syntax that adds the image after the syntax that adds the vector text. If the image is opaque, it will cover all the text. This is done in the RepeatImage example:

PdfReader reader = new PdfReader(src);
// We assume that there's a single large picture on the first page
PdfDictionary page = reader.getPageN(1);
PdfDictionary resources = page.getAsDict(PdfName.RESOURCES);
PdfDictionary xobjects = resources.getAsDict(PdfName.XOBJECT);
PdfName imgName = xobjects.getKeys().iterator().next();
Image img = Image.getInstance((PRIndirectReference)xobjects.getAsIndirectObject(imgName));
img.setAbsolutePosition(0, 0);
img.scaleAbsolute(reader.getPageSize(1));
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.getOverContent(1).addImage(img);
stamper.close();
reader.close();

Take a look at the resulting PDF; now you can still select the vector text, but it's no longer visible.

Upvotes: 2

Related Questions