Romain Che
Romain Che

Reputation: 1

Struggling to analyse numbers on documents with openCv and pyTesseract

I'm new on the OCR world and I have document with numbers to analyse with Python, openCV and pytesserract. The files I received are pdfs and the numbers are not text. So, I converted it to jpg with this :

first_page = convert_from_path(path__to_pdf, dpi=600, first_page=1, last_page=1)
first_page[0].save(TEMP_FOLDER+'temp.jpg', 'JPEG')

Then , the images look like this : I still have some noise around the digits.

enter image description here

I tried to select the "black color only" with this :

img_hsv = cv2.cvtColor(img_raw, cv2.COLOR_BGR2HSV)
img_changing = cv2.cvtColor(img_raw, cv2.COLOR_RGB2GRAY)

low_color = np.array([0, 0, 0])
high_color = np.array([180, 255, 30])

blackColorMask = cv2.inRange(img_hsv, low_color, high_color)

img_inversion = cv2.bitwise_not(img_changing)
img_black_filtered = cv2.bitwise_and(img_inversion, img_inversion, mask = blackColorMask)
img_final_inversion = cv2.bitwise_not(img_black_filtered)

So, with this code, my image looks like this : enter image description here

Even with cv2.blur, I don't even reach 75% of image FULLY analysed. For at least 25% of the images, pytesseract misses 1 or more digits. Is that normal ? Do you have ideas of what I can do to maximize the succesfull rate ?

Thanks

Upvotes: 0

Views: 263

Answers (2)

K J
K J

Reputation: 11849

Your attempt to process a field entry was thwarted by "artifacts" see upper pair for my best result with your coloured source.

enter image description here

Normal advice is use greyscale but in this case that makes matters worse as there is background chatter.

enter image description here

You were right to attempt thresholding, as that will produce clearer results, however tesseract is prone to odd line and white space insertion when characters are not words.

enter image description here

I suggested you double check if there was no vector data in the file and it appears you uncovered an entry (annotation ?) that matched the data field.

Upvotes: 0

Esraa Abdelmaksoud
Esraa Abdelmaksoud

Reputation: 1689

Whenever you see that Tesseract is missing a character or digit, think about page segmentation modes. If the character is not correct but was read, it is a recognition issue.

OCR engines split the text in the image we input, and this splitting is called page segmentation. Then, the engines try to recognize the text. Tesseract supports 13 page modes as follows:

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

For your case, the best solution would be treating your image as a block to avoid missing any digits. Then, restrict the output to digits only to get a better result. Your code should be like this:

text = pytesseract.image_to_string(image, lang='eng',
config='--psm 6 -c tessedit_char_whitelist=0123456789') 
print(text)

Output:

1821293045013

Upvotes: 1

Related Questions