sprogissd
sprogissd

Reputation: 3075

Why can't Pytesseract recognize plain white text on black?

I have a lot of images like below that I need to use pytesseract with to grab the white text:

enter image description here

I use the following code, but the results are not impressive:

import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('topLine.png')
print pytesseract.image_to_string(im)

Results:

Rouse Services | Renta Dastbonrd | Blei Rental



RJ |G | B (mmm @

So I thought the reason was non-text inside the image. I cropped the part of the image with the most important text to me and ran the same code against it:

enter image description here

However, all I got was blank. Pytesseract didn't find any text at all. What am I doing wrong?

Upvotes: 1

Views: 2362

Answers (1)

ben shapiro
ben shapiro

Reputation: 121

To answer your original question is I believe their training dataset is only on black text white background so its not surprising the machine learning algorithm wont pick up the inverse. Now for the solution, if the black box with white text is in a specific spot in the images every time, i would just crop it out, inverse it, then put it back in the same spot. otherwise you can use erode/dilate tools with a customized kernel to find these black boxes and essentially create a masking over that part of the image. Using this masking you can say hey python, here is a black box with white text inverse it. In my experience, pytesseract has always needed at least some image processing (if not alot) to get good output, but even with the most screwed up images i have been able to get accuracies above 93%.

Upvotes: 2

Related Questions