Pytesseract : Unable to extract date from image

I am trying to extract date from an image image here.The date is located at the bottom right corner, so I have also tried cropping the image first and then trying to extract date, but none of them seem to work.

import cv2
import pytesseract
import matplotlib.pyplot as plt

img = cv2.imread("datepic.jpg")
date = pytesseract.image_to_string(img,config='--psm 10 -c tessedit_char_whitelist=0123456789')
date

##cropping the image to get only the date part
from PIL import Image
img = Image.open("datepic.jpg")
left =320
right = 460
top = 330
bottom = 350
m1 = img.crop((left, top, right, bottom))
plt.imshow(m1)
date = pytesseract.image_to_string(m1, config='-psm 9 tessedit_char_whitelist=0123456789')
date

I have played with the psm value also, but none of them gave desired results.

Upvotes: 1

Answers (1)

Ahx

Reputation: 7985

Using the information the date is located at the bottom right corner, we can estimate its position. Next, we can upsample and perform color-segmentation of the image for an accurate recognition, to make the digits more visible for both human-eye and tesseract. Finally, we apply adaptive-threshold to extract and OCR the letters.

1. Upsampling and performing color-segmentation: We load the image, crop the the date part by estimating its coordinates. Upsample (using cv2.resize) the cropped region to make its visible. To perform color-segmentation, we need to convert the cropped region to HSV format, define lower/upper ranges and perform color segmentation using cv2.inRange to obtain a binary mask.
1. Extracting date part: After obtaining binary mask we will use it to remove the background and separate date part from the rest of the image using cv2.bitwise_and. Arithmetic operation and is highly useful for defining roi in hsv colored images.
1. Applying adaptive-threshold After extracting the date from the image, we will see a dense-yellow area which make it impossible to read from the roi. Therefore simple-thresholding methods are not useful. We will use cv2.adaptiveThreshold to make the features detectable by the tesseract. Note that blockSize and C parameters can vary from image-to-image.
1. Perform OCR: After isolating only the features from roi, we don't need to use any page-segmentation-mode. We simply use image_to_string without any configuration and get the result.

Steps

Estimating the date position:
- If you divide the width into 5 equal-distinct part, you need last two-part and the height of the image slightly up from the bottom:
- If we upsample the image:
- Now the image is readable and clear. However the yellow dense area is a blocking artifact for OCR.
- ```
img = cv2.imread("QqLso.jpg")
(h, w) = img.shape[:2]
crp = img[h-30:h, int((3*w)/5):w]
crp = cv2.resize(crp, (0, 0), fx=5, fy=5)
```

Mask generated from color-segmentation

The idea is to remove the area which does not contain date information.

hsv = cv2.cvtColor(crp, cv2.COLOR_BGR2HSV)
lwr = np.array([0, 102, 0])
upr = np.array([179, 255, 255])
msk = cv2.inRange(hsv, lwr, upr)

Extacting date part using bitwise_and
- Now we have the date-information with artifact blocking the day and month data.
- ```
res = 255 - cv2.bitwise_and(crp, crp, mask=msk)
```

Applying adaptive-threshold

The artifact is partially gone and the image is suitable for extracting OCR.

crp_img = cv2.cvtColor(res, cv2.COLOR_HSV2BGR)
crp_gry = cv2.cvtColor(crp_img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(crp_gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 31, 9)

Result of tesseract-ocr
- ```
01/02/2021.14:213
```

Code:

import cv2
import numpy as np
import pytesseract

# Estimate the position
img = cv2.imread("QqLso.jpg")
(h, w) = img.shape[:2]
crp = img[h-30:h, int((3*w)/5):w]

# Up-sample
crp = cv2.resize(crp, (0, 0), fx=5, fy=5)

# Create binary mask
hsv = cv2.cvtColor(crp, cv2.COLOR_BGR2HSV)
lwr = np.array([0, 102, 0])
upr = np.array([179, 255, 255])
msk = cv2.inRange(hsv, lwr, upr)

# Remove background
res = 255 - cv2.bitwise_and(crp, crp, mask=msk)

# Adaptive-threshold
crp_img = cv2.cvtColor(res, cv2.COLOR_HSV2BGR)
crp_gry = cv2.cvtColor(crp_img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(crp_gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 31, 9)

# OCR
txt = pytesseract.image_to_string(thr)
print(txt)

# display
cv2.imshow("thr", thr)
cv2.waitKey(0)

To find lower and upper boundaries of the mask, you may find useful: HSV-Threshold-script

Upvotes: 2

Pytesseract : Unable to extract date from image

Answers (1)

Related Questions