Tesseract - unable to recognize Greek letters at all

I am trying to automatically extract a scale (scale bar + a number + unit) from an image. Here is an example:

It is used to map pixels to real world measurement.

I am using PyTesseract (installed through Anaconda3).

Here is my code:

import cv2
import pytesseract
import numpy as np

img = cv2.imread('pbmk_scale.tif')
#img = cv2.imread('ocr_test_greek_and_english.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening

# Line detection for the scale line
edges = cv2.Canny(gray,50,150,apertureSize = 3)
minLineLength = 100
maxLineGap = 10
lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)
x1,y1,x2,y2 = lines[0][0]
print('Line (' + str(x1) + ',' + str(y1) + ') -- (' + str(x2) + ',' + str(y2) + ')')
# Calculating lenght of scale line in pixels. Since the line is always horizontal we need to just subtract the X coordinates
l = abs(x1 - x2)
print('Line is ' + str(l) + 'px long')

# Text recognition for the scale number and real unit
# FIXME Greek not detected. Is it grc or ell for the configuration? Both don't work
custom_config = r'-l grc+eng --psm 1' # Greek (for mu and nu letters) and English (for m (metre))
text = pytesseract.image_to_string(img, config=custom_config)
print('OUTPUT:', text.split())
number = [int(s) for s in text.split() if s.isdigit()]
print('Number is ' + str(number))

So far it is working quite nicely especially since the image is generated through Helium Ion microscopy and the label (where the scale bar is positioned) is automatically generated and stored along with the image as TIFF. So detecting text and lines is spot on. In addition the scale bar is always in the same location in the image and the actual scale line is always horizontal. The code above has its flaws but I'm more interested in the fact that I am unable to detect anything but English.

Sadly Anaconda is quite cryptographic when it comes to the description of the packages it provides (especially if you look at the navigator). So I did a little digging and under C:\Users\USER_NAME\anaconda3\envs\MachineLearning\tessdata (with MachineLearning being my custom virtual environment) I found two things:

There are only two .traineddata files - eng.traineddata and osd.traineddata
The eng.traineddata is much smaller (almost 10 times) than the one I found in the git repo of the Tesseract project, hosted on GitHub.

I downloaded multiple trained data files (eng, ell and grc). I did a test with just grc and ell (separate plus combined) and a test with both Greek and English in the image. For example the following image (after removing the line detection part of the code above)

yields the following result:

OUTPUT: ['Here’s', 'some', 'GBeek', 'Od10', 'd1ota', 'iumEedit', 'Oy']

I tried various values for the PSM parameter (that made sense of course) but nothing changed.

I am new to OCR and Tesseract so I'm probably missing something quite obvious.

Upvotes: 7

Answers (2)

Antonio Abrantes

Reputation: 591

it is missing the command to say where is yout exe tesseract file

pytesseract.pytesseract.tesseract_cmd = 'D:/Tesseract-OCR/tesseract.exe'
custom_config = r'-l grc+eng --psm 1' # Greek (for mu and nu letters) and English (for m (metre))

Upvotes: 1

Cpt_Nemos

Reputation: 11

Try to use png images and that code:

from PIL import Image
from pytesseract import *

img_path = r'your image path'
tessdata_dir_config = r'C:\Program Files\Tesseract-OCR\tessdata'
language = 'grc'

def process_image(iamge_name, lang_code, tessdata_dir_config):
    return pytesseract.image_to_string(Image.open(iamge_name), lang=lang_code, config=tessdata_dir_config)

def print_data(data):
    print(data)

def output_file(filename, data):
    file = open(filename, "w+")
    file.write(data)
    file.close()

def main():
    data_gr = process_image(img_path, language, tessdata_dir_config)
    print_data(data_gr)
    #output_file('my_ocr', data_gr)

if  __name__ == '__main__':
    main()

Upvotes: 0

Tesseract - unable to recognize Greek letters at all

Answers (2)

Related Questions