Reputation: 1
I'm new on the OCR world and I have document with numbers to analyse with Python, openCV and pytesserract. The files I received are pdfs and the numbers are not text. So, I converted it to jpg with this :
first_page = convert_from_path(path__to_pdf, dpi=600, first_page=1, last_page=1)
first_page[0].save(TEMP_FOLDER+'temp.jpg', 'JPEG')
Then , the images look like this : I still have some noise around the digits.
I tried to select the "black color only" with this :
img_hsv = cv2.cvtColor(img_raw, cv2.COLOR_BGR2HSV)
img_changing = cv2.cvtColor(img_raw, cv2.COLOR_RGB2GRAY)
low_color = np.array([0, 0, 0])
high_color = np.array([180, 255, 30])
blackColorMask = cv2.inRange(img_hsv, low_color, high_color)
img_inversion = cv2.bitwise_not(img_changing)
img_black_filtered = cv2.bitwise_and(img_inversion, img_inversion, mask = blackColorMask)
img_final_inversion = cv2.bitwise_not(img_black_filtered)
So, with this code, my image looks like this :
Even with cv2.blur, I don't even reach 75% of image FULLY analysed. For at least 25% of the images, pytesseract misses 1 or more digits. Is that normal ? Do you have ideas of what I can do to maximize the succesfull rate ?
Thanks
Upvotes: 0
Views: 263
Reputation: 11849
Your attempt to process a field entry was thwarted by "artifacts" see upper pair for my best result with your coloured source.
Normal advice is use greyscale but in this case that makes matters worse as there is background chatter.
You were right to attempt thresholding, as that will produce clearer results, however tesseract is prone to odd line and white space insertion when characters are not words.
I suggested you double check if there was no vector data in the file and it appears you uncovered an entry (annotation ?) that matched the data field.
Upvotes: 0
Reputation: 1689
Whenever you see that Tesseract is missing a character or digit, think about page segmentation modes. If the character is not correct but was read, it is a recognition issue.
OCR engines split the text in the image we input, and this splitting is called page segmentation. Then, the engines try to recognize the text. Tesseract supports 13 page modes as follows:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
For your case, the best solution would be treating your image as a block to avoid missing any digits. Then, restrict the output to digits only to get a better result. Your code should be like this:
text = pytesseract.image_to_string(image, lang='eng',
config='--psm 6 -c tessedit_char_whitelist=0123456789')
print(text)
Output:
1821293045013
Upvotes: 1