Teja Nandamuri
Teja Nandamuri

Reputation: 11201

How to tell tesseract to not ignore blank spaces between words?

I'm trying to implement business card scan app. I'm using tesseract library.

I read articles related to improving Tesseract performance, and I tried few by pre processing the image before passing it to Tesseract.

I found Tesseract works best with grayscale/black_white images.

I'm having trouble with choosing the right page segmentation.

So far,

G8PageSegmentationModeSingleBlock (Assume a single uniform block of text)

given me the best results for business card format.

Here are the results using this segmentation mode:

GrayScale:

enter image description here

When using grayscale image, Tesseract is recognising the words (look at the red rectangle), but somehow sometimes it is recognising the space between the words.

Here is the output:

o
f l ,t!ti,iy,,,tyii,i,,!),i),,m,i,st,,,i,t,)) ',
REAL E:ESrry"irfEf
SOLUTIONS WC, n
TimTsai        ----> (space missing here)
Investor & Consultant
p 780.803.9935
f 888.803.1485
e [email protected]
w www.lnnoventionGroup.ca

Black & White :

enter image description here

This is little bit better than grayscale interms of identifying the space between words, but this also recognizes the borders of the image as letters, and append them to the original/actual text. (See how the red rectangle is prolonged to the edge of the image, as the segmentation mode is set to identify uniform block of text)

Here is the output:

o,
f I t,!h,tig/i,i,,ip,,ip,iy (,
REAL ESTATE i,
SOLUTIONS INC. (i,
Tim Tsai i;,      ------> (yay, got the space)
Investor & Consultant ii,
p 780.803.9935 :i,
f 888.803.1485 i:,
e [email protected] (i,
,
-ee_--e_-----e----------ir-eeeereree-e-re---------------, u p

I also tried removing the border, and this time, it didn't read the blank space between the words.

enter image description here

output:

 o
I I !,,!ih,tle/IiEhp,tt,l,l),!
REAL ESTATE
SOLUTIONS INC.
TimTsai
Investor & Consultant
p 780.803.9935
f 888.803.1485
e [email protected]

Question:

What is the reason for this behaviour (ignoring spaces between words ?)

In what possible way I can improve this, so that tesseract will not ignore blank spaces all the time ?

I can also look at rotation/deskewing but I'm not sure how much that can improve the performance in this cases as the text looks horizontal to me mostly.

Code:

G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"];
tesseract.delegate = self;
tesseract.engineMode=G8OCREngineModeTesseractCubeCombined;

// Optional: Limit the character set Tesseract should try to recognize from
tesseract.charWhitelist = @"@.,&():ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ";

tesseract.charBlacklist=@"$%^*={};<>\\~`";

// Specify the image Tesseract should recognize on
    tesseract.image = [img g8_blackAndWhite];

tesseract.sourceResolution=kG8MaxCredibleResolution;


// Optional: Limit the area of the image Tesseract should recognize on to a rectangle
CGRect tessRect = CGRectMake(0, 0, tesseract.image.size.width, tesseract.image.size.height);

    tesseract.rect = tessRect;

// Optional: Limit recognition time with a few seconds
tesseract.maximumRecognitionTime = 60.0;

// Start the recognition
[tesseract recognize];

// Retrieve the recognized text
NSLog(@"text %@", [tesseract recognizedText]);

Upvotes: 3

Views: 9863

Answers (1)

try-catch-finally
try-catch-finally

Reputation: 7634

Set preserve_interword_spaces to true to preserve multiple spaces between words.

Your code might look like this:

tesseract.setVariable("preserve_interword_spaces", "1");

For the command line interface use the -c switch this way:

tesseract image.jpg output -c preserve_interword_spaces=1

(Voluntary answer from helpful comments; credits to user nguyenq)

Upvotes: 1

Related Questions