Reputation: 57
I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE: Why I need to extract the fonts. Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.
Upvotes: 1
Views: 706
Reputation: 31207
The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.
Upvotes: 2