Anthony Kong
Anthony Kong

Reputation: 40624

Mismatched font issue when converting PDF to JPEG using ImageMagick on Ubuntu?

I am using this command to convert a PDF to a set of JPEG files:

convert -strip -quality 100 -alpha off \
        -density 165% -scene 1 tmp3GtW_h.pdf /tmp/a1.jpg

Here is the original PDF:

enter image description here

The font is thinner and more akin to Helvetica.

Here is the outcome:

enter image description here

The font in the output JPEG file is different and thicker.

The convert command shows this warning:

   **** Warning:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Microsoft? PowerPoint? 2013 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

The version of convert is:

$ convert --version
Version: ImageMagick 6.8.9-7 Q16 x86_64 2014-12-30 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC
Features: DPC OpenMP
Delegates: jng jpeg png x xml zlib

Ghostscript version is:

$ gs --version
9.10

My questions are

1) How can I resolve this issue?

2) How can I tell what font the PDF file is using?

3) How can I tell what fonts are available to convert and gs?

EDIT: Found an answer to question 2. Here is the outcome from the pdffonts command:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Intro Black Italic                   Type 1            WinAnsi          no  no  no     145  0
Intro Regular                        Type 1            WinAnsi          no  no  no     147  0
Intro Black Inline Caps              Type 1            WinAnsi          no  no  no     388  0
ABCDEE+Segoe UI                      TrueType          WinAnsi          yes yes no    2233  0
ABCDEE+Segoe UI,Italic               CID TrueType      Identity-H       yes yes yes   2607  0
ABCDEE+Segoe UI,Italic               TrueType          WinAnsi          yes yes no    2612  0
Intro Bold Italic                    Type 1            WinAnsi          no  no  no    3781  0

Upvotes: 4

Views: 3208

Answers (1)

Kurt Pfeifle
Kurt Pfeifle

Reputation: 90193

If you want to know all relevant details about the fonts used by a PDF document, use

pdffonts the.pdf

You'll see in the column emb indicated with yes or no if a font is embedded.

If a font is NOT embedded, such things will happen as you see: the PDF renderer does not find the font in the file, so it uses a substitution font:

  1. If you are lucky, it finds one on the local system with the same or a similar name, and the rendered pages will look like it did look for the producer of the PDF (who must have had a font with the name used by the PDF on his system).
  2. If you are more unlucky, it uses a substitution font that is not really suitable, and doesn't look good or "right".
  3. If you are very unlucky, the substitution doesn't work at all and the page looks like garbage.

But the document will most likely look different from viewer to viewer, and from system to system. Because each viewer uses a different algorithm to substitute missing fonts.

The pdffonts command has the -subst parameter. So

pdffonts -subst the.pdf

will report, what substitution fonts could be possibly be used. Since Poppler, the library pdffonts is based upon uses FreeType as its font engine, this reported substitution fonts will likely be valid for every viewer that also uses FreeType.

Acrobat for example does NOT use FreeType, but its own font rendering engine. So in Adobe Reader you'll likely get different substitution fonts.


Ghostscript:

The command

gs -h

will report (amongst other things) which directories it will use as its path to search for fonts.

Any Ghostscript command you run can be amended by

-sFONTPATH=/path/to/dir:/path/to/other/dir

to tell Ghostscript to look in other directories for needed fonts for the duration of the current command.

ImageMagick:

This command

convert -list font

will report all fonts which ImageMagick has found on the system.


Update: (after update to question)

So very clearly that four different Intro fonts are not embedded in the PDF. This is a very uncommon font, certainly not in the top 200 used worldwide in PDFs (I should know, because I've harvested 1.000.000 PDFs from the web and am currently creating a statistical database about their various properties -- I don't have a single Intro in there...).

Whoever created that PDF, or whichever software did so, clearly didn't have much clue about document processing. Because every other system or user or application which has to open, view or process that document will see a very different view of those pages using these fonts from what its creator saw.

In order to process this PDF into images you should not rely on ImageMagick, but run Ghostscript directly:

  1. Locate the directories where the four Intro fonts are to be found.
  2. Run the Ghostscript command with the -sFONTPATH=... parameter as explained above.

Let me re-iterate:

  1. You cannot force or suggest to convert to use any font for rendering the PDF pages to raster images.
  2. This is because ImageMagick never gets to see the PDF itself. What ImageMagick receives, is a raster image, which has been produced by Ghostscript.
  3. Once Ghostscript is done with its work, the accident has happened already, and convert cannot insert any 'font' into the raster data in the aftermath.
  4. The fonts that convert can use are only for its own drawing, writing, captioning and annotating operations.
  5. So you have to run Ghostscript directly, and supply the -sFONTPATH=... argument.
  6. You have to find out yourself, where on your system that Intro font family is. I cannot do that for you, sorry.

Running convert -verbose will give you some insight about how exactly ImageMagick employs Ghostscript as its 'delegate' for PDF input processing, and which command line parameters it uses....

Upvotes: 5

Related Questions