Reputation: 2540
This is not truly a duplicate of How to extract decimal in image with Pytesseract, as those answers did not solve my problem and my use case is different.
I'm using PyTesseract to recognise text in table cells. When it comes to recognising drug doses with decimal points, the OCR fails to recognise the .
, though is accurate for everything else. I'm using tesseract v5.0.0-alpha.20200328
on Windows 10.
My pre-processing consists of upscaling by 400% using cubic, conversion to black and white, dilation and erosion, morphology, and blurring. I've tried a decent combination of all of these (as well as each on their own), and nothing has recognized the .
.
I've tried --psm
of various values as well as a character whitelist. I believe the font is Sergoe UI
.
PyTesseract output: 25mg »p
Processing code:
import cv2, pytesseract
import numpy as np
image = cv2.imread( '01.png' )
upscaled_image = cv2.resize(image, None, fx = 4, fy = 4, interpolation = cv2.INTER_CUBIC)
bw_image = cv2.cvtColor(upscaled_image, cv2.COLOR_BGR2GRAY)
kernel = np.ones((2, 2), np.uint8)
dilated_image = cv2.dilate(bw_image, kernel, iterations=1)
eroded_image = cv2.erode(dilated_image, kernel, iterations=1)
thresh = cv2.threshold(eroded_image, 205, 255, cv2.THRESH_BINARY)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
morh_image = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)
blur_image = cv2.threshold(cv2.bilateralFilter(morh_image, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
final_image = blur_image
text = pytesseract.image_to_string(final_image, lang='eng', config='--psm 10')
Upvotes: 0
Views: 1241
Reputation: 7121
I had a similar case that and was able to increase the number of correct decimals by using image processing methods and upscaling of the image. Yet, a small share of the decimals were not recognized correctly.
The solution I found was to change the language setting for pytesseract:
I was using a non-English setting, but changing the config to lang='eng'
fixed all remaining issues.
That might not help with the original question, though, as the setting is already eng
.
Upvotes: 0
Reputation: 11
If you haven't made sure of this, check out this link
visit https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/xk2ErJnFBQAJ
One major solution for for many problems is text height, I was facing many issues but wasn't able to figure out why, but seems sending image with correct size letters to tesseract solves many problems. instead of upscaling to a random % try the number with which your image has letters close to 30- 40 Px.
Also if somehow your preprocessing change "." into a noise like char then too it will get ignored.
Upvotes: 1