Reputation: 976
I have a PDF file containing some tabular data.
http://dl.dropbox.com/u/44235928/sample_rotate-0.pdf
I have to extract the tabular data from it. I have tried following with no success :
The OCR solution is not very accurate though( about 80% words matched ).
I tried changing density and geometry of the image created from PDF to get better results from tesseract OCR.
convert -rotate 90 -geometry 10000 -depth 8 -density 800 sample.pdf img_800_10000.tif;
tesseract img_800_10000.tif img_800_10000.tif nobatch letters;
I am not sure for what kind of image( density, geometry, monochromatic, sharpen boundary etc) would be best suited for the OCR.
Please suggest what could be the best possible parameters(density,geometry,depth etc) for generating images from a PDF file, so that the tesseract accuracy will increase.
I am open to other( non-ocr ) solutions as well.
Upvotes: 4
Views: 3787
Reputation: 90295
Since a comment asked "How to define the geometry of an image when using Ghostscript as we do in convert
?", here is an answer:
It does not make sense to define geometry (that is image dimensions) and resolution for a raster image created by Ghostscript at the same time.
Once you convert a vector based page of a given dimension (such as PDF) into a raster image (such as the TIFF G4 format) giving a desired resolution (as done in the other answer), you already indirectly and implicitly also did set the dimension:
1008x612
points.-r72
in the Ghostscript command if given directly) the image dimensions will be 1008x612
pixels.-r720
in the Ghostscript command) the image dimensions will be 10080x6120
pixels.-r1440
in the Ghostscript command of my other answer) the image dimensions will be 20160x12240
pixels.-r1200
in the Ghostscript command) the image dimensions will be 16800x10200
pixels.-r1000
in the Ghostscript command) the image dimensions will be 14000x8500
pixels.-r120
in the Ghostscript command) the image dimensions will be 1680x1020
pixels.-r100
in the Ghostscript command) the image dimensions will be 1400x850
pixels.If you absolutely insist to specify the dimension/geometry for the output image on the Ghostscript commandline (rather than the resolution), you can do so by adding -gNNNNxMMMM -dPDFFitPage
to the commandline.
Upvotes: 1
Reputation: 90295
In this case I recommend to NOT use ImageMagick for the PDF -> TIFF conversion. Instead, use Ghostscript. Two reasons:
Using Ghostscript directly will give you more control over individual parameters of the conversion.
ImageMagick cannot do that particular conversion itself -- it will call Ghostscript as its 'delegate' anyway, but will not allow you to give all the same fine-grained control that your own Ghostscript command will give you.
Most of the text in the table of your sample PDF is extremely small (I guess, only 4 or 5 pt high). This makes it rather difficult to run a successful OCR unless you increase the resolution considerably.
Ghostscript uses -r72
by default for image format output (such as TIFF). Tesseract works best with r=300 or r=400 -- but only for a font size from 10-12 pt or higher. Therefor, to compensate for the small text size you should make Ghostscript using a resolution of at least 1200 DPI when it renders the PDF to the image.
Also, you'll have to rotate the image so the text displays in the normal reading direction (not bottom -> top).
This is the command which I would try first:
gs \
-o sample.tif \
-sDEVICE=tiffg4 \
-r1200 \
-dAutoRotatePages=/PageByPage \
sample_rotate-0.pdf
You may need to play with variations of the -r1200
parameter (higher or lower) for best results.
Upvotes: 4
Reputation: 3892
There you can find decoded content of your file: https://docs.google.com/open?id=0B1YEM-11PerqSHpnb1RQcnJ4cFk
A absolutely sure the OCR is the best way to read pdf file, but you can try REGEX-ing the native content. It going to be be the hard and long way.
Upvotes: 0