Md Zain Ahmed
Md Zain Ahmed

Reputation: 11

Pytesseract : Unable to extract date from image

I am trying to extract date from an image image here.The date is located at the bottom right corner, so I have also tried cropping the image first and then trying to extract date, but none of them seem to work.

import cv2
import pytesseract
import matplotlib.pyplot as plt

img = cv2.imread("datepic.jpg")
date = pytesseract.image_to_string(img,config='--psm 10 -c tessedit_char_whitelist=0123456789')
date

##cropping the image to get only the date part
from PIL import Image
img = Image.open("datepic.jpg")
left =320
right = 460
top = 330
bottom = 350
m1 = img.crop((left, top, right, bottom))
plt.imshow(m1)
date = pytesseract.image_to_string(m1, config='-psm 9 tessedit_char_whitelist=0123456789')
date

I have played with the psm value also, but none of them gave desired results.

Upvotes: 1

Views: 1286

Answers (1)

Ahx
Ahx

Reputation: 7985

Using the information the date is located at the bottom right corner, we can estimate its position. Next, we can upsample and perform color-segmentation of the image for an accurate recognition, to make the digits more visible for both human-eye and tesseract. Finally, we apply adaptive-threshold to extract and OCR the letters.

    1. Upsampling and performing color-segmentation: We load the image, crop the the date part by estimating its coordinates. Upsample (using cv2.resize) the cropped region to make its visible. To perform color-segmentation, we need to convert the cropped region to HSV format, define lower/upper ranges and perform color segmentation using cv2.inRange to obtain a binary mask.
    1. Extracting date part: After obtaining binary mask we will use it to remove the background and separate date part from the rest of the image using cv2.bitwise_and. Arithmetic operation and is highly useful for defining roi in hsv colored images.
    1. Applying adaptive-threshold After extracting the date from the image, we will see a dense-yellow area which make it impossible to read from the roi. Therefore simple-thresholding methods are not useful. We will use cv2.adaptiveThreshold to make the features detectable by the tesseract. Note that blockSize and C parameters can vary from image-to-image.
    1. Perform OCR: After isolating only the features from roi, we don't need to use any page-segmentation-mode. We simply use image_to_string without any configuration and get the result.

Steps

  • Estimating the date position:

    • enter image description here

    • If you divide the width into 5 equal-distinct part, you need last two-part and the height of the image slightly up from the bottom:

    • enter image description here

    • If we upsample the image:

    • enter image description here

    • Now the image is readable and clear. However the yellow dense area is a blocking artifact for OCR.

    • img = cv2.imread("QqLso.jpg")
      (h, w) = img.shape[:2]
      crp = img[h-30:h, int((3*w)/5):w]
      crp = cv2.resize(crp, (0, 0), fx=5, fy=5)
      
  • Mask generated from color-segmentation

    • enter image description here

    • The idea is to remove the area which does not contain date information.

    • hsv = cv2.cvtColor(crp, cv2.COLOR_BGR2HSV)
      lwr = np.array([0, 102, 0])
      upr = np.array([179, 255, 255])
      msk = cv2.inRange(hsv, lwr, upr)
      
  • Extacting date part using bitwise_and

    • enter image description here

    • Now we have the date-information with artifact blocking the day and month data.

    • res = 255 - cv2.bitwise_and(crp, crp, mask=msk)
      
      
  • Applying adaptive-threshold

    • enter image description here

    • The artifact is partially gone and the image is suitable for extracting OCR.

    • crp_img = cv2.cvtColor(res, cv2.COLOR_HSV2BGR)
      crp_gry = cv2.cvtColor(crp_img, cv2.COLOR_BGR2GRAY)
      thr = cv2.adaptiveThreshold(crp_gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                                  cv2.THRESH_BINARY_INV, 31, 9)
      
  • Result of tesseract-ocr

    • 01/02/2021.14:213
      

Code:


import cv2
import numpy as np
import pytesseract

# Estimate the position
img = cv2.imread("QqLso.jpg")
(h, w) = img.shape[:2]
crp = img[h-30:h, int((3*w)/5):w]

# Up-sample
crp = cv2.resize(crp, (0, 0), fx=5, fy=5)

# Create binary mask
hsv = cv2.cvtColor(crp, cv2.COLOR_BGR2HSV)
lwr = np.array([0, 102, 0])
upr = np.array([179, 255, 255])
msk = cv2.inRange(hsv, lwr, upr)

# Remove background
res = 255 - cv2.bitwise_and(crp, crp, mask=msk)

# Adaptive-threshold
crp_img = cv2.cvtColor(res, cv2.COLOR_HSV2BGR)
crp_gry = cv2.cvtColor(crp_img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(crp_gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 31, 9)

# OCR
txt = pytesseract.image_to_string(thr)
print(txt)

# display
cv2.imshow("thr", thr)
cv2.waitKey(0)

To find lower and upper boundaries of the mask, you may find useful: HSV-Threshold-script

Upvotes: 2

Related Questions