Wazime
Wazime

Reputation: 1678

Ghostscript output PDF: text can not be copied

I am using TCPDF in order to create PDF files.

Because TCPDF has a bug in the font subsetting (link to bug),
I use the following Ghostscript command to subset fonts in the TCPDF-created PDF file:

gswin64c.exe -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
    -dPDFSETTINGS=/prepress -dUseFlateCompression=false -dEmbedAllFonts=true \
    -dSubsetFonts=true -sOutputFile="out.pdf" "input.pdf"

It works great and reduces the file size. But when I try to parse the PDF file as text (with poppler -> pdftotext) or when I open the file in PDF viewer and select text I get gibberish on UTF-8 fonts.

In order to reproduce it here is the file before ghostscript and file after ghostscript.

If you open it in Adobe reader copy the text and paste it to somewhere else, you can see that you can copy text from the file "before GS". But in the second file you get gibberish unless you copy english characters (files are in Hebrew).

Other than that the file looks great.

Do you have any idea on how to preserve the UTF8 fonts in Ghostscript?

Upvotes: 1

Views: 3966

Answers (1)

KenS
KenS

Reputation: 31139

Yes, don't subset the fonts. Subsetting the fonts causes them to be re-encoded. Because your fonts don't have a ToUnicode CMap, the copy/paste only works by heuristics (ie the character codes have to be meaningful) In your case the character codes are, or appear to be, Unicode, so you are in luck, the heuristics work.

Once you subset the fonts, Ghostscript re-encodes them. So the character codes are no longer Unicode. In the absence of a ToUnicode CMap, the copy/paste no longer works.

The only way you can get this to work is to not re-encode the fonts, which means you cannot subset them using Ghostscript's pdfwrite device. In fact, because you are using CIDFonts with TrueType outlines, you can't avoid subsetting the fonts, so basically, this won't work.

Please bear in mind that Ghostscript's pdfwrite device is not intended as a tool for manipulating PDF files!

By the way, your PDF file has other problems, It scales a font (Tf operator) to 0, and it has a BBox for a Form where all the co-ordinates are 0 (and indeed the form has no content, so pointless). This is in addition to a CIDFont with no ToUnicode CMap. Perhaps you should consider a different tool for production of PDF files.

Upvotes: 2

Related Questions