Reputation: 462
I have this code that converts a PDF to Text file:
gswin32c -dBATCH -dNOPAUSE -dSAFER -dDELAYBIND -dWRITESYSTEMDICT
-dSIMPLE -sDEVICE=txtwrite -dTextFormat=2 -dFirstPage=1 -dLastPage=1
-sOutputFile=C:\out.txt C:\in.pdf
It works almost fine, the only thing it does not keep the PDF table formatting.
Example:
In the PDF file:
Type From Name Name2 Code Week
Regular 30/03/15 KNOWLES, BEN HOOT KNOWLES, ANGELA 367-739-746 80.00
Regular 30/03/15 RICHARDS, COLE ROBERT HARRIS, BRADIE 401-844-307 108.00
Regular 30/03/15 SKEELS, MATT BISHOP, JASON GREGSON 413-980-291 112.00
After convert it to text file, the text gets wrapped like this:
Type From Name Name2 Code Week
Regular30/03/15KNOWLES, BENHOOT KNOWLES, ANGELA367-739-74680.00
Regular30/03/15RICHARDS, COLEROBERT HARRIS, BRADIE401-844-307108.00
Regular30/03/15SKEELS, MATTBISHOP, JASON GREGSON413-980-291112.00
I need it to keep its formatting. Any idea how to keep the formatting?
I am using Ghostscript gswin32c
on windows 7 machine, version is 9.16.
Also, I am open to suggestions for others way to archive it.
Cheers
Upvotes: 1
Views: 3610
Reputation: 3528
pdftotext
from poppler-utils
with the -layout
option works acceptably well for this
Upvotes: 0
Reputation: 31141
There isn't a 'table format' in PDf, just a sequence of text and positions. One of the possible output formats for txtwrite attempts to make a Unicode text file, where the spacing is re-created by space characters. Note that this assumes a fixed-pitch font, so it won't work well if you don't use one.
Without seeing the input PDF file its not really possible to make any guesses as to why this isn't producing output as you expect.
You can tackle this problem yourself. Firstly because there are other potential output formats, one of them is an XML-like format which emits the text sequences and positions, you could use that and recreate the format yourself (or even just archive it directly). Alternatively, since Ghostscript is open-source, you could read and debug the source yourself and figure out why your PDF file is causing a problem.
Upvotes: 0