OCR - how to get text from outlined words

Question

I have an image of text, where the words are outlined rather than filled in. Tesseract is struggling to get any of the words correct - does anyone have a solution to these types of problems?

I have tried simple operations like inversion, but to no affect. I'm guessing tesseract already handles this.

Img example:
Typical output for Next: New
Typical output for Previous: Pﬂevuows

(my very simple) Code, takes the image as an argument:

import pytesseract
import sys
from PIL import Image

print(pytesseract.image_to_string(Image.open(sys.argv[1])))
print(sys.argv[1])

EDIT: Applying a threshold binary can get me next, but does not seem to get previous still.

raghav m · Accepted Answer

This is probably too late for you, but it'll help anyone who sees this. I had this same problem and I fixed it. (Solution is using OpenCV)

First, use a binary threshold. With the right values, your letters shouldn't touch and this should work well. This is specifically so you can floodfill with success instead of getting stuck on faded gray colors (which it seems is what happened when you tried it before)

After this, floodfill with black. Since your letters don't touch the borders this should fill everything, although when I was doing it, I had to call floodfill on every outermost pixel in the image.

Lastly, flip the image colors. This can be done with cv2.bitwise_not(). Now it should be ready for OCR.

OCR - how to get text from outlined words

Answers (1)

Related Questions