KleberBH
KleberBH

Reputation: 462

Ghostscript to convert pdf to text and keep PDF file table format

I have this code that converts a PDF to Text file:

gswin32c -dBATCH -dNOPAUSE -dSAFER -dDELAYBIND -dWRITESYSTEMDICT 
-dSIMPLE -sDEVICE=txtwrite -dTextFormat=2 -dFirstPage=1 -dLastPage=1 
-sOutputFile=C:\out.txt C:\in.pdf

It works almost fine, the only thing it does not keep the PDF table formatting.

Example:

In the PDF file:

Type    From        Name             Name2                   Code         Week
Regular 30/03/15    KNOWLES, BEN     HOOT KNOWLES, ANGELA    367-739-746  80.00       
Regular 30/03/15    RICHARDS, COLE   ROBERT HARRIS, BRADIE   401-844-307  108.00      
Regular 30/03/15    SKEELS, MATT     BISHOP, JASON GREGSON   413-980-291  112.00

After convert it to text file, the text gets wrapped like this:

Type From Name Name2 Code Week
Regular30/03/15KNOWLES, BENHOOT KNOWLES, ANGELA367-739-74680.00       
Regular30/03/15RICHARDS, COLEROBERT HARRIS, BRADIE401-844-307108.00      
Regular30/03/15SKEELS, MATTBISHOP, JASON GREGSON413-980-291112.00

I need it to keep its formatting. Any idea how to keep the formatting?

I am using Ghostscript gswin32c on windows 7 machine, version is 9.16.

Also, I am open to suggestions for others way to archive it.

Cheers

Upvotes: 1

Views: 3610

Answers (2)

Treviño
Treviño

Reputation: 3528

pdftotext from poppler-utils with the -layout option works acceptably well for this

Upvotes: 0

KenS
KenS

Reputation: 31141

There isn't a 'table format' in PDf, just a sequence of text and positions. One of the possible output formats for txtwrite attempts to make a Unicode text file, where the spacing is re-created by space characters. Note that this assumes a fixed-pitch font, so it won't work well if you don't use one.

Without seeing the input PDF file its not really possible to make any guesses as to why this isn't producing output as you expect.

You can tackle this problem yourself. Firstly because there are other potential output formats, one of them is an XML-like format which emits the text sequences and positions, you could use that and recreate the format yourself (or even just archive it directly). Alternatively, since Ghostscript is open-source, you could read and debug the source yourself and figure out why your PDF file is causing a problem.

Upvotes: 0

Related Questions