Is there a better way to separate the writing from the background?

Question

I am working on a project where I should apply and OCR on some documents.
The first step is to threshold the image and let only the writing (whiten the background).

Example of an input image: (For the GDPR and privacy reasons, this image is from the Internet)

Here is my code:

import cv2
import numpy as np


image = cv2.imread('b.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
h = image.shape[0]
w = image.shape[1]
for y in range(0, h):
    for x in range(0, w):
        if image[y, x] >= 120:
            image[y, x] = 255
        else:
            image[y, x] = 0
cv2.imwrite('output.jpg', image)

Here is the result that I got:

When I applied pytesseract to the output image, the results were not satisfying (I know that an OCR is not perfect). Although I tried to adjust the threshold value (in this code it is equal to 120), the result was not as clear as I wanted.

Is there a way to make a better threshold in order to only keep the writing in black and whiten the rest?

Ishara Dayarathna · Accepted Answer

You can use adaptive thresholding. From documentation :

In this, the algorithm calculate the threshold for a small regions of the image. So we get different thresholds for different regions of the same image and it gives us better results for images with varying illumination.

import numpy as np
import cv2



image = cv2.imread('b.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image = cv2.medianBlur(image ,5)

th1 = cv2.adaptiveThreshold(image,255,cv2.ADAPTIVE_THRESH_MEAN_C,\
            cv2.THRESH_BINARY,11,2)
th2 = cv2.adaptiveThreshold(image,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
            cv2.THRESH_BINARY,11,2)
cv2.imwrite('output1.jpg', th1 )
cv2.imwrite('output2.jpg', th2 )

Is there a better way to separate the writing from the background?

Answers (2)

Related Questions