Lemon Tree
Lemon Tree

Reputation: 63

How to read digits from an image using pytesseract

I'm trying to read the digits from this image:

num

Using pytesseract with these settings:

custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(img, config=custom_config)

This is the output:

((E ST7 [71aT6T2 ] THETOGOG5 15 [8)

Upvotes: 2

Views: 2979

Answers (1)

Andrew James
Andrew James

Reputation: 402

Whitelisting only integers, as well as changing your psm provides much better results. You also need to remove carriage returns, and white space. Below is code that does that.

import pytesseract
import re
from PIL import Image

#Open image
im = Image.open("numbers.png")

#Define configuration that only whitelists number characters
custom_config = r'--oem 3 --psm 11 -c tessedit_char_whitelist=0123456789'

#Find the numbers in the image
numbers_string = pytesseract.image_to_string(im, config=custom_config)

#Remove all non-number characters
numbers_int = re.sub(r'[a-z\n]', '', numbers_string.lower())

#print the output
print(numbers_int)

The result of the code on your image is: '31477423353'

Unfortunately, a few numbers are still missing. I tried some experimentation, and downloaded your image and erased the grid.

enter image description here

After removing the grid and executing the code again, pytesseract produces a perfect result: '314774628300558'

So you might try to think about how you can remove the grid programmatically. There are alternatives to pytesseract, but regardless you will get better output with the text isolated in the image.

Upvotes: 5

Related Questions