Claudiga
Claudiga

Reputation: 437

Ghostscript converting pdf to text file, output is unreadable

I was trying to convert a pdf document into text file. everything works until i open the output file to see its unreadable the characters are in some Chinese font

" 琀攀猀琀 "

this is my command line

gswin64c.exe -ps2ascii -sDEVICE=txtwrite -sOutputFile=outputtext.txt test.pdf 

im i doing something wrong?

Upvotes: 2

Views: 2240

Answers (1)

KenS
KenS

Reputation: 31141

You haven't posted the file, so its not possible to be absolutely certain, however....

Almost certainly the text in your PDF file is not encoded using an ASCII encoding scheme (possibly contains sunset fonts), and does not contain a ToUnicode CMap for the font in question. Additionally the glyph names are not standard names (or its a TrueType font, which don't have named glyphs).

Without any of the above information txtwrite doesn't have any clue what the character codes represent, and so simply emits them verbatim.

Given that you are seeing Chinese glyphs, I would suspect that the original font is a CIDFont, probably a TrueType font, subset and has no ToUnicode CMap.

In this case, the only way to get the text out will be to use OCR.

Upvotes: 2

Related Questions