Good image but no text coming from OCR, why? Python, Skimage, PIL, Tesseract

I am quite new to Image Processing, CV and OCR. So far I think its an amazing subject and that I am willing to dig it further.

Imagine I have this original image: Original page

I resize it to this: resized

Then I found regional maxima and get this image(to avoid lighter backgrounds and too noisy ones):

Then submit the above image to a threshold and with processing have this image: Tresholded This image seems to me that is not 100% binary... if I zoom it in it shows some gray pixels inside characters...

I was thinking that this last image should be enough/(very good, indeed) for OCR, don't you think? But there is no text coming out of it...

My code:

#http://stackoverflow.com/questions/18813300/finding-the-coordinates-of-maxima-in-an-image
from PIL import *
from PIL import Image
import numpy as np
from skimage import io
from skimage import img_as_float
from scipy.ndimage import gaussian_filter
from skimage.morphology import reconstruction
import pytesseract

im111 = Image.open('page.jpg')

basewidth = 1000
wpercent = (basewidth / float(im111.size[0]))
hsize = int((float(im111.size[1]) * float(wpercent)))
image_resized = im111.resize((basewidth, hsize), Image.ANTIALIAS)
image_resized.save('page2.jpg')

image = img_as_float(io.imread('page2.jpg', as_grey=True))
image = gaussian_filter(image, 1)
seed = np.copy(image)
seed[1:-1, 1:-1] = image.min()
mask = image
dilated = reconstruction(seed, mask, method='dilation')
image = image - dilated

#print type(image)

#io.imsave("RegionalMaxima.jpg", image)

im = np.array(image * 255, dtype = np.uint8)
a = np.asarray(im)
img = Image.fromarray(a)

#img.show()

#print type(img)
#img.save('RegionalMaximaPIL.jpg')

#image2 = Image.open('RegionalMaxima.jpg')

minima, maxima = img.getextrema()
print "------Extrema1----------" + str(minima), str(maxima)
mean = int(maxima/4)
im1 = img.point(lambda x: 0 if x<mean else maxima, '1')
im1.save('Thresh_calculated.jpg')
#im1.show()

mini, maxi = im1.getextrema()
print "-------Extrema2(after1stTresh)---------" + str(mini), str(maxi)

im2 = im1.point(lambda x: 0 if x<128 else 255, '1')
im2.save('Thresh_calculated+++.jpg')
im2.show()

text = pytesseract.image_to_string(im2)
print "-----TEXT------" + text

What am I doing wrong? pytesseract.image_to_string(im1) with the thresholded image should be retrieving some text already :/

Other doubts: the second "getextrema()" the results should not be 0 and 255??? I am confused since they still present me the same numbers before the first threshold... so the image resulted from the second treshold is all black.

Thanks so much for your time and help.

Upvotes: 3

Answers (2)

dsturbid

Reputation: 2106

I found that sometimes it has problems with JPGs but works fine on a PNG of the same image. So i convert the file to PNG and read it.

Upvotes: 0

Mark Setchell

Reputation: 207758

Sorry, I don't speak python but I have some experience with tesseract from the command line. From some experiments I did a while back, I think the sweet-spot for tesseract to recognise letters is when they are around 30-50 pixels tall.

Following that logic, I extracted a portion of your image with ImageMagick to encompass the words Nokia and the 225. I then resized the resulting two rows of text plus a bit of vertical space to 160 pixels, i.e. to make the letters around 50 pixels high.

convert nokia.jpg -crop 1000x800+1800+1000 -resize x160 x.jpg

enter image description here

Then I ran tesseract as follows and looked at the recognised text:

tesseract x.jpg text
Tesseract Open Source OCR Engine v3.02.02 with Leptonica

more text*
NOKIA
225

I am not pretending this is the miracle solution - I am just saying I would chop out some text - maybe using "Connected Component Analysis" (or something else) and resize it so that the text is around 30-80 pixels high and see what gives.

Feel free to ask any questions in the comments and I will see what I can do - or maybe some other clever folk will know more and chip in their thoughts...

I had a moment to do some more experimenting, so I tried to find the sweet spot for the overall height of yuor image to help tesseract be more successful. I varied the height from 100 to 500 pixels in steps of 10 and then looked at the resulting OCR'ed text like this:

for x in $(seq 100 10 500); do 
  convert nokia.jpg -resize x$x small.jpg
  echo Height:$x
  tesseract small.jpg text >/dev/null 2>&1 && grep -E "NOKIA|225" text*
done

Height:100
Height:110
Height:120
Height:130
Height:140
NOKIA
225
Height:150
Height:160
Height:170
225
Height:180
Height:190
NOKIA
225
Height:200
225
Height:210
NOKIA
225
Height:220
NOKIA
225
Height:230
NOKIA
225
Height:240
NOKIA
225
Height:250
NOKIA
225
Height:260
Height:270
NOKIA
225
Height:280
Height:290
NOKIA
225
Height:300
NOKIA
225
Height:310
Height:320
NOKIA
225
Height:330
Height:340
Height:350
NOKIA
225
Height:360
NOKIA
225
Height:370
NOKIA
225
Height:380
NOKIA
225
Height:390
Height:400
Height:410
Height:420
Height:430
Height:440
Height:450
NOKIA
225
Height:460
Height:470
Height:480
Height:490
Height:500

Upvotes: 7

Good image but no text coming from OCR, why? Python, Skimage, PIL, Tesseract

Answers (2)

Related Questions