Reputation: 437
I was trying to convert a pdf document into text file. everything works until i open the output file to see its unreadable the characters are in some Chinese font
" 琀攀猀琀 "
this is my command line
gswin64c.exe -ps2ascii -sDEVICE=txtwrite -sOutputFile=outputtext.txt test.pdf
im i doing something wrong?
Upvotes: 2
Views: 2240
Reputation: 31141
You haven't posted the file, so its not possible to be absolutely certain, however....
Almost certainly the text in your PDF file is not encoded using an ASCII encoding scheme (possibly contains sunset fonts), and does not contain a ToUnicode CMap for the font in question. Additionally the glyph names are not standard names (or its a TrueType font, which don't have named glyphs).
Without any of the above information txtwrite doesn't have any clue what the character codes represent, and so simply emits them verbatim.
Given that you are seeing Chinese glyphs, I would suspect that the original font is a CIDFont, probably a TrueType font, subset and has no ToUnicode CMap.
In this case, the only way to get the text out will be to use OCR.
Upvotes: 2