Reputation: 77
I am trying to detect text part of an image(jpg file) using Tesseract-OCR and OpenCV in Python. The text part of the imageis Turkish, therefore I am using 'Turkish trained data (tur)' which is in Tesseract-OCR file. I have applied dilation and erosion to remove the noise before using tesseract.
The problem is, eventhough some of the characters in particular areas can be detected, the detection is mostly unsuccesful and fails to detect Turkish characters. Do you know any method or do you have any suggestion to get more success. Here are my codes below :
import pytesseract
from PIL import Image
import cv2
img= cv2.imread('C:\Users\gulsa\Desktop\Tesseract-OCR\alm98_2.jpg')
img = Image.open('alm98_2.jpg')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-
OCR/tesseract'
tex = pytesseract.image_to_string(Image.open('alm98_2.jpg'),lang='tur')
print(tex)
Thank you in advance!
Upvotes: 1
Views: 3279
Reputation: 2357
Here's what i get after using tesseract on your image
HerTürdenErutikyıdeplç'nTıkla!Sımsıkainlemereoyo AnındaCebirıdenIde!Iziemeklçin18YaşındanBüyükoin'ak Zorunludur.HerkamgoridenyüzleroevideoHighDefTvde!High DefTv,abonelik"servistir.Pakelhaîlaliktümvergilerdahilolamk ayda64TLyebtaIedimedig'süreoeherz—ıyyenileneoekîir.Servis ücreti,aboneoldugınuzoperaîöfündüzenleyecegifaîuralar karaliylaveyaönödemelihatlardanTL/Krmikîaridüsülerekîahsil edilecektir.Ipîaliğn:|PTALya24329z-ıgörder.Iptaledilendönem içinücretiadasiyapiin'azXeteriibakiyenizyokayükleme
So far it doesn't seem like a very bad result. Not saying its very good one, but nothing to do with Turkish letters. You can get much better results if you will be able to detect and separate letters which are too close to each other at the moment.
For example for this image i get perfect results (notice better font, more space between chars)
Her Türden Erotik Video Için Tıkla!Sımsicak Binlerce Videoyu
If you're getting a lot of noisy letters which are definitely not in the Turkish alphabet (like fl or î symbols) - you can make a blacklist.
Another option is to iterate through tesseract results character to character and correct them if you can use any heuristic for that.
Edit: TBH when i try to read the text on your image I cannot separate words from the sentence, maybe it is specific of font you're using, but it definitely looks too harsh for both human and machine.
Edit2: Added example image with more space between chars
Upvotes: 1