RagHaven
RagHaven

Reputation: 4327

Tesseract returns non English characters

I recently followed some tutorials to setup Tesseract and now I am trying to see if the OCR is working properly. When I take a picture and get the text I am sometimes getting Non English characters. It actually seems like gibberish. I have posted an example of an output I got below:

 ; .'—--~_~:~ ear
 .::§—‘.::~__>‘Z~r'.‘ ,::-SES‘:3£a"3'§_“5.E.~ °?®.=_-
 .—_;%~‘=*c§u-5; H =—oc+-»o cn-5 '55:.

The picture I took was the first page from the research article in this link. I'm not sure why this is happening. I have the eng.traineddata file within the tessdata sub directory as well.

Upvotes: 1

Views: 524

Answers (1)

sschrass
sschrass

Reputation: 7156

there are two things that come to my mind:

  • train tesseract for the font that is used in the image
  • edit the image beforehand
    • grayscale
    • resize
    • dilate
    • smoothing
    • gaussian blur
    • ... and so on

For the editing I can recommend ImageMagic.

Upvotes: 1

Related Questions