CuriousGeorge
CuriousGeorge

Reputation: 311

pytesseract using tesseract 4.0 numbers only not working

Any one tried to get numbers only calling the latest version of tesseract 4.0 in python?

The below worked in 3.05 but still returns characters in 4.0, I tried removing all config files but the digits file and still didn't work; any help would be great:

im is an image of a date, black text white background:

import pytesseract
im =  imageOfDate
im = pytesseract.image_to_string(im, config='outputbase digits')
print(im)

Upvotes: 18

Views: 53156

Answers (4)

Tejesh Teju
Tejesh Teju

Reputation: 117

You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng',config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Upvotes: 5

mhellmeier
mhellmeier

Reputation: 2291

As you can see in this GitHub issue, the blacklist and whitelist doesn't work with tesseract version 4.0.

There are 3 possible solutions for this problem, as I described in this blog article:

  1. Update tesseract to version > 4.1
  2. Use the legacy mode as described in the answer from @thewaywewere
  3. Create a python function which uses a simple regex to extract all numbers:

    def replace_chars(text):
        list_of_numbers = re.findall(r'\d+', text)
        result_number = ''.join(list_of_numbers)
        return result_number
    
    result_number = pytesseract.image_to_string(im)
    

Upvotes: 2

Robert Harris
Robert Harris

Reputation: 249

Using tessedit_char_whitelist flags with pytesseract did not work for me. However, one workaround is to use a flag that works, which is config='digits':

import pytesseract
text = pytesseract.image_to_string(pixels, config='digits')

where pixels is a numpy array of your image (PIL image should also work). This should force your pytesseract into returning only digits. Now, to customize what it returns, find your digits configuration file, on Windows mine was located here:

C:\Program Files (x86)\Tesseract-OCR\tessdata\configs

Open the digits file and add whatever characters you want. After saving and running pytesseract, it should return only those customized characters.

Upvotes: 11

thewaywewere
thewaywewere

Reputation: 8636

You can specify the numbers in the tessedit_char_whitelist as below as a config option.

ocr_result = pytesseract.image_to_string(image, lang='eng', boxes=False, \
           config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Hope this help.

Upvotes: 16

Related Questions