Reputation: 43
I'm trying to use Tesseract to read an image, but it returns gibberish. I know I need to do some pre-processing, but what I have found online doesn't seem to work with my image. I tried this answer to turn the picture from black background/white letters to white background/black letters without success.
This is the picture.
And my simple code:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'D:\Tesseract-OCR\tesseract'
img = Image.open("2020-01-25_17-57-49_UTC.jpg")
print(pytesseract.image_to_string(img))
Upvotes: 1
Views: 1140
Reputation:
Cobbling code found here on SO
from PIL import Image
import PIL.ImageOps
import pytesseract
img = Image.open("8pjs0.jpg")
inverted_image = PIL.ImageOps.invert(img)
print(pytesseract.image_to_string(inverted_image))
gives me
Dolar Hoy en Cucuta
25-Enero-20
01:00PM
78.048
VENTA
I think you'll need some sort of language packs for the accented characters.
Upvotes: 2
Reputation: 46670
A simple Otsu's threshold to obtain a binary image then an inversion to get the letters in black and the background in white seems to work. We use --psm 3
to tell Pytesseract to perform automatic page segmentation. Take a look at Pytesseract OCR multiple config options for more configuration options. Here's the preprocessed image
Result from Pytesseract OCR
Dolar Hoy en Cucuta
25-Enero-20
01:00PM
78.048
VENTA
Code
import cv2
import numpy as np
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, grayscale, threshold, invert
image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
result = 255 - thresh
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(result, config='--psm 3')
print(data)
cv2.imshow('result', result)
cv2.waitKey()
Upvotes: 1