Reputation: 488
I am having a problem when converting PDF to Images using ImageMagick or Ghostscript. All accented characters disappear from the converted image. I found a couple of people having the same problem and apparently updating ImageMagick package and Ghostscript fixed it, but not for me.
I am using this PDF file on every tests I made: https://www.dropbox.com/s/3gso0sw1e1n8f9r/error-with-accents.pdf?dl=0
I have an Ubuntu 14.04.2 LTS server on Azure where I need ImageMagick to work. From the official repositories I have ImageMagick 6.7.7 and Ghostscript 9.10. Later, I tried upgrading them in order to fix my issue and now I have also ImageMagick 6.8.9-10 running on /opt/imagemagick-6.8
folder and I added Ubuntu's 15.04 repository so I could install Ghostscript 9.15 directly through apt-get. None of these fixed the problem for me.
Here are my latests attempts on the Ubuntu 14.04 server:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
$ /opt/imagemagick-6.8/bin/convert -version
Version: ImageMagick 6.8.9-10 Q16 x86_64 2015-07-30 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC
Features: DPC OpenMP
Delegates: jng jpeg png x xml zlib
$ /opt/imagemagick-6.8/bin/convert -list configure |grep DELEGATES
DELEGATES mpeg jng jpeg png ps x xml zlib
$ /opt/imagemagick-6.8/bin/convert error-with-accents.pdf -verbose -alpha off -resample 150 -density 150 -quality '80' im-test.jpg
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.10.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
error-with-accents.pdf=>im-test.jpg PDF 595x794=>1240x1654 1240x1654+0+0 16-bit sRGB 172KB 0.440u 0:00.240
$ gs -v
GPL Ghostscript 9.15 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc. All rights reserved.
$ gs -dBATCH -dNOPAUSE -sDEVICE=jpeg -sOutputFile=gs-test.jpg error-with-accents.pdf
GPL Ghostscript 9.15 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 XXXXX n' as a free entry.
Processing pages 1 through 1.
Page 1
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.10.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
$ convert -version
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
$ convert -list configure |grep DELEGATES
DELEGATES bzlib djvu fftw fontconfig freetype jbig jpeg jng jp2 lcms2 lqr lzma openexr pango png rsvg tiff x11 xml wmf zlib
$ convert error-with-accents.pdf -verbose -alpha off -resample 150 -density 150 -quality '80' im-test-6.7.7.jpg
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.10.4 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
error-with-accents.pdf=>im-test-6.7.7.jpg PDF 595x794=>1240x1654 1240x1654+0+0 16-bit DirectClass 160KB 0.490u 0:00.279
All with the same results:
I am able to run Ghostscript and ImageMagick correctly on a Mac OS. And, according to this post, the versions I have on Ubuntu should work. So I'm thinking it's something related to FreeType fonts. Which I know nothing on how to fix this. Any help?
Upvotes: 2
Views: 2239
Reputation: 3123
Problem still exists after 6 years. Trying to convert pdf to png images and some of the accented characters are missing from the rendered text. Finally I resolved the problem in the following way: I used the "gs" command to convert a few pages from the pdf document.
gs -sDEVICE=pngalpha -o file-%03d.jpg -r300 document.pdf
Then I examined the output of gs, finding complaints of the missing font(s). In my case it was ArialMT. Then I simply picked up a font from my Linux system TTF font directory, and copied it into ghostscript font directory, substituting explicitely the missing font:
sudo cp /usr/share/fonts/TTF/DejaVuSans.ttf /usr/share/ghostscript/9.53.3/Resource/Font/ArialMT
As ghostscript fonts has no extensions. Then I was able to use imagemagick or graphicsmagick "convert" command to convert the pdf document into png images:
for i in $(seq 0 143);do echo $i;gm convert -density 600 document.pdf[$i] -verbose -colorspace RGB -flatten ./png/$i.png ;done
I used a for loop and graphicsmagick, as I experienced that is much faster than imagemagick, and in a page by page manner, as converting large documents at once caused memory problems.
Upvotes: 0
Reputation: 90203
The PDF document you are trying to process was very often modified and re-saved: 455 times between 2010-03-06 and 2014-06-17.
You can verify that by running pdfinfo -meta error-with-accents.pdf
.
I do not speak or read Portuguese, so I cannot recognize immediately if an accent is missing in an output image where one should be.
When I tried your command, with IM v6.9.0-0 Q16 x86_64 2015-05-14
(using Ghostscript v9.16
), I do no see any error:
Your PDF has all the fonts it uses embedded (see the emb
column). This means, that FreeType will not be employed to look for any replacement/substitute font:
$ pdffonts error-with-accents.pdf
name type encoding emb sub uni object ID
-------------------------- ---------- ---------------- --- --- --- ---------
RUXYWW+ConduitITC-Light Type 1C MacRoman yes yes no 14 0
NOYZMG+Y2KNeophyte TrueType WinAnsi yes yes yes 10 0
MVLYKX+ConduitITC-Medium Type 1C MacRoman yes yes no 15 0
JDNVDM+ConduitITC-Bold Type 1C MacRoman yes yes no 13 0
In any case: You should concentrate to get a version of Ghostscript which processes your PDF correctly. Because ImageMagick does not do any PDF processing on its own -- it relies on Ghostscript as its "delegate" to do so.
Upvotes: 2