Reputation: 12036
I am trying to OCR pdf file with tesseract
, but it says:
Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:upload526.pdf IMAGE::read_header:Error:Can't read this image type:upload526.pdf tesseract:Error:Read of file failed:upload526.pdf Segmentation fault
I need it to make a database to search through pdfs that were scanned manually (to images)... What am I doing wrong? I read that it supports pdfs... No idea what version it is as tesseract --version
or tesseract -v
doesn't work at all.
Upvotes: 1
Views: 1163
Reputation: 2859
You could try something along the lines of this (ImageMagick library):
convert -density 300 file.pdf -depth 8 file.tiff
tesseract file.tiff output
Upvotes: 1
Reputation: 8345
Tesseract does not read PDF. You'll need to convert it to an image format (TIFF, PNG) first. Try GhostScript, ImageMagick, programming, etc.
Upvotes: 1