How to tell tesseract to not ignore blank spaces between words?

Question

I'm trying to implement business card scan app. I'm using tesseract library.

I read articles related to improving Tesseract performance, and I tried few by pre processing the image before passing it to Tesseract.

I found Tesseract works best with grayscale/black_white images.

I'm having trouble with choosing the right page segmentation.

So far,

G8PageSegmentationModeSingleBlock (Assume a single uniform block of text)

given me the best results for business card format.

Here are the results using this segmentation mode:

GrayScale:

When using grayscale image, Tesseract is recognising the words (look at the red rectangle), but somehow sometimes it is recognising the space between the words.

Here is the output:

o
f l ,t!ti,iy,,,tyii,i,,!),i),,m,i,st,,,i,t,)) ',
REAL E:ESrry"irfEf
SOLUTIONS WC, n
TimTsai        ----> (space missing here)
Investor & Consultant
p 780.803.9935
f 888.803.1485
e tim@lnnoventionGroup.ca
w www.lnnoventionGroup.ca

Black & White :

This is little bit better than grayscale interms of identifying the space between words, but this also recognizes the borders of the image as letters, and append them to the original/actual text. (See how the red rectangle is prolonged to the edge of the image, as the segmentation mode is set to identify uniform block of text)

Here is the output:

o,
f I t,!h,tig/i,i,,ip,,ip,iy (,
REAL ESTATE i,
SOLUTIONS INC. (i,
Tim Tsai i;,      ------> (yay, got the space)
Investor & Consultant ii,
p 780.803.9935 :i,
f 888.803.1485 i:,
e tim@lnnoventionGroup.ca (i,
,
-ee_--e_-----e----------ir-eeeereree-e-re---------------, u p

I also tried removing the border, and this time, it didn't read the blank space between the words.

output:

 o
I I !,,!ih,tle/IiEhp,tt,l,l),!
REAL ESTATE
SOLUTIONS INC.
TimTsai
Investor & Consultant
p 780.803.9935
f 888.803.1485
e tim@lnnoventionGroup.ca

Question:

What is the reason for this behaviour (ignoring spaces between words ?)

In what possible way I can improve this, so that tesseract will not ignore blank spaces all the time ?

I can also look at rotation/deskewing but I'm not sure how much that can improve the performance in this cases as the text looks horizontal to me mostly.

Code:

G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"];
tesseract.delegate = self;
tesseract.engineMode=G8OCREngineModeTesseractCubeCombined;

// Optional: Limit the character set Tesseract should try to recognize from
tesseract.charWhitelist = @"@.,&():ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ";

tesseract.charBlacklist=@"$%^*={};<>\~`";

// Specify the image Tesseract should recognize on
    tesseract.image = [img g8_blackAndWhite];

tesseract.sourceResolution=kG8MaxCredibleResolution;


// Optional: Limit the area of the image Tesseract should recognize on to a rectangle
CGRect tessRect = CGRectMake(0, 0, tesseract.image.size.width, tesseract.image.size.height);

    tesseract.rect = tessRect;

// Optional: Limit recognition time with a few seconds
tesseract.maximumRecognitionTime = 60.0;

// Start the recognition
[tesseract recognize];

// Retrieve the recognized text
NSLog(@"text %@", [tesseract recognizedText]);

How to tell tesseract to not ignore blank spaces between words?

Answers (1)

Related Questions