Edge
Edge

Reputation: 2540

PyTesseract not recognizing decimals

This is not truly a duplicate of How to extract decimal in image with Pytesseract, as those answers did not solve my problem and my use case is different.

I'm using PyTesseract to recognise text in table cells. When it comes to recognising drug doses with decimal points, the OCR fails to recognise the ., though is accurate for everything else. I'm using tesseract v5.0.0-alpha.20200328 on Windows 10.

My pre-processing consists of upscaling by 400% using cubic, conversion to black and white, dilation and erosion, morphology, and blurring. I've tried a decent combination of all of these (as well as each on their own), and nothing has recognized the ..

I've tried --psm of various values as well as a character whitelist. I believe the font is Sergoe UI.

Before processing: pre pre-processed

After processing: enter image description here

PyTesseract output: 25mg »p

Processing code:

import cv2, pytesseract
import numpy as np

image = cv2.imread( '01.png' )
upscaled_image = cv2.resize(image, None, fx = 4, fy = 4, interpolation = cv2.INTER_CUBIC)
bw_image = cv2.cvtColor(upscaled_image, cv2.COLOR_BGR2GRAY)

kernel = np.ones((2, 2), np.uint8)
dilated_image = cv2.dilate(bw_image, kernel, iterations=1)
eroded_image = cv2.erode(dilated_image, kernel, iterations=1)

thresh = cv2.threshold(eroded_image, 205, 255, cv2.THRESH_BINARY)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
morh_image = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
            
blur_image = cv2.threshold(cv2.bilateralFilter(morh_image, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

final_image = blur_image
text = pytesseract.image_to_string(final_image, lang='eng', config='--psm 10')

Upvotes: 0

Views: 1241

Answers (2)

Joe
Joe

Reputation: 7121

I had a similar case that and was able to increase the number of correct decimals by using image processing methods and upscaling of the image. Yet, a small share of the decimals were not recognized correctly.

The solution I found was to change the language setting for pytesseract:

I was using a non-English setting, but changing the config to lang='eng' fixed all remaining issues.

That might not help with the original question, though, as the setting is already eng.

Upvotes: 0

Deepika Gohrani
Deepika Gohrani

Reputation: 11

If you haven't made sure of this, check out this link

visit https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/xk2ErJnFBQAJ

One major solution for for many problems is text height, I was facing many issues but wasn't able to figure out why, but seems sending image with correct size letters to tesseract solves many problems. instead of upscaling to a random % try the number with which your image has letters close to 30- 40 Px.

Also if somehow your preprocessing change "." into a noise like char then too it will get ignored.

Upvotes: 1

Related Questions