marek.kapowicki
marek.kapowicki

Reputation: 732

Tesseract OCR ignores the lines contain the asterisk

I have the image created from old fax document (the font is specific) Generally Tesseract works pretty ok with this input, except one use case. When the line starts with many leading asterisk '*' than it is ignored.

The result produces by ocr is different for given psm the fax with *

  1. psm 1: the empty page
  2. psm 6: For queries please contact NA, KRKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK KKK KK KK KK

In every use case the tittle "comment" is skipped

But when I manually in Paint removed the all '*' from imageimage without * then the ocr works fine. I ve no idea how to process the ocr without image preprocessing. Can someone understand it?

Upvotes: 0

Views: 268

Answers (1)

user898678
user898678

Reputation: 3328

Try this: tesseract 9UIKs.png - --psm 4 --oem 0

Which produces:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxkkkk COMMENT kkkkxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

For queries p1ease contact NA.

XXXXXKKKKXXXVXKKKKKXXXXXKXXXXXXXXXXXXKXXXXXXXXXXXXXXXXKKXXXXXXXXXXXXXXXXXKXXXX.

You will need language model with support for legacy engine (from here https://github.com/tesseract-ocr/tessdata)

Upvotes: 1

Related Questions