David Hedley
David Hedley

Reputation: 364

Ghostscript re-encoding embedded font

I am using Ghostscript (9.14) to "clean-up" PDFs prior to distribution with the pdfwrite driver While it works very well in general, I have noticed that it is frequently re-encoding embedded fonts which often has the effect of preventing sensible text extraction for searching etc.

An example file before ghostscript processing is here: http://download.vistair.com/ghostscript/in.pdf and the result after processing with ghostscript is here: http://download.vistair.com/ghostscript/out.pdf

Sensible text extraction is possible with the input file, but not with the output file.

Looking in the PDF, in the input file we have:

obj 9 0
 Type: /Font
 Referencing: 12 0 R, 14 0 R

  <<
    /BaseFont /GCCBBY+TT187t00
    /Encoding 12 0 R
    /FirstChar 1
    /FontDescriptor 14 0 R
    /LastChar 41
    /Subtype /TrueType
    /Type /Font
    /Widths [352 684 633 973 596 427 636 636 636 636 751 632 684 616 695 787 989 421 748 686 575 601 521 633 521 394 274 607 633 623 623 274 352 364 698 623 623 592 592 592 636]
  >>


obj 12 0
 Type: /Encoding
 Referencing:

  <<
    /BaseEncoding /WinAnsiEncoding
    /Differences [1/space/S/u/m/e/r/two/zero/one/four/H/E/A/T/R/O/W/I/N/B/F/a/c/h/s/t/i/o/n/p/b/l/f/period/C/d/g/y/v/k/endash]
    /Type /Encoding
  >>

In the ghostscript-processed file this has become:

obj 8 0
 Type: /Font
 Referencing: 9 0 R

  <<
    /BaseFont /OWPYKO+TT187t00
    /FontDescriptor 9 0 R
    /Type /Font
    /FirstChar 2
    /LastChar 6
    /Widths [ 684 633 973 596 427]
    /Subtype /TrueType
  >>

So the font encoding information has been lost and the text is no longer extractable.

Is there a way to stop ghostscript re-encoding existing embedded fonts (or at least preserve any existing font encoding)?

Upvotes: 2

Views: 1431

Answers (1)

KenS
KenS

Reputation: 31141

To be blunt, no. Its a TrueType font, and they always get converted to a symbolic font (for complex reasons to do with the way that Ghostscript works).

In the past we did emit an Encoding, because Acrobat will use an Encoding for a TrueType font (even for a Symbolic font, which it should not do). However, the PDF spes is quite clear that symbolic fonts should not specify an Encoding, and it reached the point where doing so was creating more problems than it solved, so we stopped doing it.

Upvotes: 1

Related Questions