Farzin
Farzin

Reputation: 11

Converting PostScript to Text Using GhostScript

I want to extract Text data out of PostScript documents. The problem is when I use GhostScript to do that, some texts would be extracted normally while others would be converted to weird symbolic characters.

I realized that the texts, which had normally been extracted, were in fonts that GhostScript would NOT embed them in PDF because of licensing restrictions. And, ironically the fonts without licensing restrictions which were normally embedded in PDF, weren’t been converting back correctly.

I tried both txtwrite device to convert directly the PostScript to Text and also pdfwrite device to first convert the PS to PDF and then extract the text out of the PDF Document, but neither of them worked.

I thought maybe I could be able to substitute all fonts with the unsupported fonts, so that the text data would be extracted correctly, but came out there is no simple way to do that.

What do you think I should do?

Upvotes: 1

Views: 1291

Answers (1)

Thomas W
Thomas W

Reputation: 15371

The cause of this is usually that the characters are encoded in a non-standard fashion. I'm afraid there is not a lot you can do, except possibly for finding out by comparing the readable PostScript with the extracted text which "weird symbolic characters" corresponds to what actual character. Then you might be able to reconstruct the original text by replacing the weird with the intended characters.

Upvotes: 1

Related Questions