KitKat
KitKat

Reputation: 181

Extracting text from PDF with Ghostscript

I am using Ghostscript 9.20 to extract the text from a PDF document that contains only two lines of text:

Hello world…
A beautiful day!

The code applied is:

gswin32c -sDEVICE=txtwrite -o output.txt input.pdf

However, the the output is:

  䠀攀氀氀漀 眀漀爀氀搀☠ 
  䄀 戀攀愀甀琀椀昀甀氀 搀愀礀℀ 

What is going on and how do I fix it?

Upvotes: 8

Views: 10917

Answers (1)

KenS
KenS

Reputation: 31141

There was a bug in the 9.20 release which affected certain kinds of text extraction. Not all, it depends on the input, and since you haven't supplied that its impossible to tell if your particular input file is affected.

To fix it you can:

  1. Clone Ghostscript from our Git repository, build and test the latest code.
  2. Wait until the next release (March) and test that.
  3. Open a bug report and someone will look at it. Though that won't actually help you. If its already been fixed, you'll then have to choose either 1 or 2. If it hasn't been fixed then you'll need to wait until it is and then follow either 1 or 2, but at least you'll have helped improve the product.

Upvotes: 4

Related Questions