Apache PDFBox Form Fill TrueType text spacing issue

Question

I'm using Apache PDFBox to fill a PDF Form. I'm using a TrueType font (not a default font) called 'Impact', pretty standard fare. In the template I have a field called "Title" that has the Impact font assigned. I use the code below to take that template and populate the field with a value that has several words in it.

The issue is when you view the created PDF there are large spaces between the words. If you open the PDF in Acrobat and click on the field the text alters and the large spacing goes away. Editing the field in any way will permanently correct the issue, but I'm generating the forms to NOT be altered after the fact.

I've tried the same experiment with the default fonts (Helvetica in this case) and the above issue doesn't exist. I can create a blank form and add a field and set the custom font and duplicate the issue.

I've read that a similar issue was addressed in 2.0.0, PDFBOX-2062 but it was for changing font size, not a custom font.

I am using PDFBox version 2.0.1.

public static void main(String[] args) throws IOException {

    String formTemplate = "/BLANK.pdf";
    String outputPDF = "/FillFormField.pdf";

    // load the documents
    PDDocument pdfDocument = PDDocument.load(new File(formTemplate));

    // get the document catalog
    PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();

    // as there might not be an AcroForm entry a null check is necessary
    if (acroForm != null)
    {
        PDTextField field = (PDTextField) acroForm.getField( "Title" );
        field.setValue("Low Mileage Beauty");
    }

    // Save and close the filled out form.
    pdfDocument.save(outputPDF);
    pdfDocument.close();

}

mkl · Accepted Answer

The problem is due to a combination of two factors:

a quirk of PDFBox when writing text and
a non-conformant font object in the source PDF.

PDFBox Quirk

When writing text into a content stream, PDFBox translates each Unicode codepoint into a name and looks up that name in a map generating from the inverted font encoding.

The font encoding in the case at hand is MacRomanEncoding. In that encoding (and similarly in WinAnsiEncoding) there are two mappings to the name space, cf. Annex D2 of the PDF specification ISO 32000-1, one given in the table:

          CHAR CODE (OCTAL)
CHAR NAME  STD MAC WIN PDF
...
     space 040 040 040 040
...

and one in footnote 6:

The SPACE character shall also be encoded as 312 in MacRomanEncoding and as 240 in WinAnsiEncoding. This duplicate code shall signify a nonbreaking space; it shall be typographically the same as (U+003A) SPACE.

The inverted font encoding can only have one value for the name space which by chance happens to be the octal 312 (= decimal 202).

As the two space glyphs are expected to be typographically the same, this quirk should be harmless. But:

Non-conformant font in the PDF

The font Impact in the PDF is defined with width 176 for the normal space glyph and 750 for the nonbreaking space glyph. Thus, they typographically differ vehemently.

As Impact in the PDF is defined to have MacRomanEncoding (with minor variations of no interest here), though, those two glyphs are required ("shall" indicates a requirement) to be typographically the same, cf. the footnote quoted above.

How to deal with this

A first, quick option would be, as @Tilman already recommended in a comment,

to set acroForm.setNeedAppearances(true)

This sets a flag that indicates to a PDF viewer that it shall re-create appearance content streams. This might not work with some previewers, though.

The next option would be to fix the source PDF which contains the non-conformant font definition.

And eventually PDFBox might want to get rid of this quirk. While it typographically should make no difference which space variant is drawn, choosing the non-breaking variant is tempting fate.

Apache PDFBox Form Fill TrueType text spacing issue

Answers (1)

PDFBox Quirk

Non-conformant font in the PDF

How to deal with this

Related Questions