NFeruch - FreePalestine
NFeruch - FreePalestine

Reputation: 1145

python preprocess image of table with multiple colors using cv2 and pytesseract

I'm having trouble preprocessing an image so that it's read correctly

enter image description here

The multicolor is messing up the OCR, not sure which preprocessing steps to take so that it interprets it correctly. I need it to work for multiple colors too, as I have more tables that have accents of blue, green, brown, etc...

Here is my code so far in jupyter notebook:

import pytesseract as tess
import cv2
import matplotlib.pyplot as plt

img = cv2.imread("table_img.png")
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
plt.imshow(img, interpolation='nearest')
plt.show()
print(tess.image_to_string(img))

How can I preprocess the image so that it's read correctly?

Upvotes: 1

Views: 506

Answers (1)

chrslg
chrslg

Reputation: 13336

That's not a simple question. There are some whole PhD dissertations about this.

But, well, in your case, if all your images are of this kind, an adaptative threshold (behaving differently when background is "greyish" and when background is purple)

OTSU is not at all convenient here. OTSU's postulate is that your image pixels are distributed in 2 colors, and that any variations around those 2 colors is just noise. And it tries to find the binarisation that both maximize the inter-class deviation and minimize the intra class deviation. In other words, find the threshold that separate the most the histogram of all pixel's value in 2 different peaks.

But that postulate is precisely what is not in your image. In your image, you have mainly 3 colors. Black, grayish, and purple. There is no way to claim that purple is just a form of gray with noise. Noise is not what explain the background difference.

So, what you need is adaptative background. That set the threshold locally.

Like this

img = cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,11,5)

Note, 11 is the dimension of the area in which average background color is computed for threshold computation. So, black pixels are the one that are lower than the mean value in that 11x11 area (it has to be odd, since it is centered around the pixel). And 5 is what is removed from mean to compute the threshold. It should be non-zero. Because if 0, then, in "white" area, threshold is the mean, and everything is exactly at threshold, not over, and appears black, or, with noise randomly black or white. In area with text, since mean is in between background and text, text is under threshold, and background over. That is why, on your image, with 0, you get essentialy black image, with some white zones containing black text.

So, that number should be non-zero. It should be big enough so that noise in white areas do not create too much black pixels (but your images don't seem noisy). And small enough so that what ever the background color is, difference between background and text is always way bigger that this 5.

Adaptative threshold

Upvotes: 2

Related Questions