drunkenfist
drunkenfist

Reputation: 3036

Editing content in pdf using PDFBox removes last line from pdf

I'm trying to edit some contents of a pdf using PDFBox in Java. The problem is, whenever I edit any string in the pdf, and try to open it using Adobe Reader, the last line does not appear in the newly rendered pdf.

When I try top open the rendered pdf directly from browser, I'm able to see the last line. However, it is encoded in a different format. I'm using the following code to edit the contents of the pdf:

PDDocument doc = PDDocument.load(FileName);
PDPage page = (PDPage) doc.getDocumentCatalog().getAllPages().get(0);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List<Object> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
    Object next = tokens.get(j);
    if (next instanceof PDFOperator) {
        PDFOperator op = (PDFOperator) next;
        if (op.getOperation().equals("Tj")) {
            COSString previous = (COSString) tokens.get(j - 1);
            String string = previous.getString();

            string = string.replace("@ordnum&", (null != data.getOrderNumber()?data.getOrderNumber():""));
            string = string.replace("@shipid&", (null != data.getShipmentId()?data.getShipmentId():""));
            string = string.replace("@customer&", (null != data.getCustomerNumber()?data.getCustomerNumber():""));
            string = string.replace("@fromname&", (null != data.getFromName()?data.getFromName():""));

            tokens.set(j - 1, new COSString(string.trim()));
        }
    }
}

Editing the pdf removes the line which says "Have questions? ...". What is the problem here? Am I doing something wrong?

Thanks.

Upvotes: 3

Views: 1871

Answers (1)

mkl
mkl

Reputation: 95918

Why that last line becomes invalid

First of all you have to be aware that there are two fundamentally different situations for strings in PDF

  • outside content streams, e.g. author and keywords for the document properties, and
  • inside content streams representing sequences of glyphs from some font to be drawn.

The former type is encoded using either PDFDocEncoding (akin to Latin1) or UTF-16BE with a leading byte-order marker. The method COSString.getString and the constructor COSString(String) are designed for this kind of strings.

The latter type is encoded using the encoding defined for the PDF font this string is to be rendered with. This may be some standardized encoding like WinAnsiEncoding (akin to Latin1) or UniGB-UTF16-H (Unicode (UTF-16BE) encoding for the Adobe-GB1 character collection). But it may also be some custom single- or multi-byte encoding. Neither the standardized nor the custom multi-byte encodings have a byte-order marker.

In the page content stream in your PDF most strings use WinAnsiEncoding (because that is the encoding of their font). Because WinAnsiEncoding and PDFDocEncoding are very similar, the PDFDocEncoding COSString method and constructor you use work quite fine for them.

That last line, though, is encoded using Identity-H which is the horizontal identity mapping for 2-byte CIDs, i.e. a two-byte encoding directly referencing a character ID in the font program without any meaning without that font program.

As this string does not start with a byte order mark, COSString.getString assumes it to use the single-byte encoding PDFDocEncoding and so creates two Java string characters for each original two-byte PDF string character. As the character values for some of these characters are outside the actually valid PDFDocEncoding range, the constructor COSString(String) creates a PDF string in which each of the intermediate Java characters is represented using one two-byte UTF-16BE character; furthermore a byte-order marker is added.

Thus, the original PDF string (in hexadecimal writing)

002b004400590048000300540058004800560057004c0052005100560022000300260052
005100570044004600570003005800560003004400570003004b0057005700530056001d
00120012005a005a005a005600110046004c0057005500580056004f0044005100480011
004600520050001200460052005100570044004600570010005800560012

after your edit becomes

FEFF002B0000004400000059000000480000000300000054000000580000004800000056
000000570000004C00000052000000510000005600000022000000030000002600000052
000000510000005700000044000000460000005700000003000000580000005600000003
0000004400000057000000030000004B00000057000000570000005300000056000002DB
00000012000000120000005A0000005A0000005A0000005600000011000000460000004C
000000570000005500000058000000560000004F00000044000000510000004800000011
000000460000005200000050000000120000004600000052000000510000005700000044
0000004600000057000000100000005800000056

Depending on the PDF viewer this may have different effects. Your original line

original line

e.g. may become spread very wide:

line spread across page

or vanish completely

line vanished

In a nutshell, therefore, if you need to edit a PDF like that, make sure that you only edit PDF strings with a Latin1-like encoding.

If you also need to edit differently encoded PDF strings, extract them as byte[] using the COSString method getBytes, edit this array in a way applicable to the encoding in question, and create a new COSString from the edited bytes using the constructor COSString(byte[]).

But even that is not a good idea at all.

Problems with editing streams like that in general

There are many other traps waiting for you when editing streams like that

  • Instead of e.g.

    (@customer&) Tj
    

    your stream may contain

    (@cust) Tj
    (omer&) Tj
    

    or

    [(@cust) -6 (omer&) ] TJ 
    

    or even

    (omer&) Tj
    -62 0 Td
    (@cust) Tj
    

    Thus, suddenly replacement may not work if a new template uses a slightly different representation.

  • Fonts may only be partially embedded. If the glyphs for the characters of your replacements are not included, they will be drawn as gaps.

  • Text drawing operations following the one you edited may count on the former one to have used a specific width. Your replacement can then destroy the former layout.

  • ...

In essence properly editing streams in generic documents is very difficult.

What else you can do

Instead of content place holders like your @customer& you can use AcroForm form fields.

Form fields have names and can be recognized by them. Filling them in does not change anything in the content.

If you don't want people afterwards to edit your PDF form fields, you can mark them as read-only or even flatten them into the content.

Upvotes: 4

Related Questions