Flash Thunder
Flash Thunder

Reputation: 12036

tesseract ocr pdf - segmentation fault

I am trying to OCR pdf file with tesseract, but it says:

Tesseract Open Source OCR Engine name_to_image_type:Error:Unrecognized image type:upload526.pdf IMAGE::read_header:Error:Can't read this image type:upload526.pdf tesseract:Error:Read of file failed:upload526.pdf Segmentation fault

I need it to make a database to search through pdfs that were scanned manually (to images)... What am I doing wrong? I read that it supports pdfs... No idea what version it is as tesseract --version or tesseract -v doesn't work at all.

Upvotes: 1

Views: 1163

Answers (2)

Reuben L.
Reuben L.

Reputation: 2859

You could try something along the lines of this (ImageMagick library):

convert -density 300 file.pdf -depth 8 file.tiff  
tesseract file.tiff output

Upvotes: 1

nguyenq
nguyenq

Reputation: 8345

Tesseract does not read PDF. You'll need to convert it to an image format (TIFF, PNG) first. Try GhostScript, ImageMagick, programming, etc.

Upvotes: 1

Related Questions