Reputation: 11201
I'm trying to implement business card scan app. I'm using tesseract library.
I read articles related to improving Tesseract performance, and I tried few by pre processing the image before passing it to Tesseract.
I found Tesseract works best with grayscale/black_white images.
I'm having trouble with choosing the right page segmentation.
So far,
G8PageSegmentationModeSingleBlock (Assume a single uniform block of text)
given me the best results for business card format.
Here are the results using this segmentation mode:
GrayScale:
When using grayscale image, Tesseract is recognising the words (look at the red rectangle), but somehow sometimes it is recognising the space between the words.
Here is the output:
o
f l ,t!ti,iy,,,tyii,i,,!),i),,m,i,st,,,i,t,)) ',
REAL E:ESrry"irfEf
SOLUTIONS WC, n
TimTsai ----> (space missing here)
Investor & Consultant
p 780.803.9935
f 888.803.1485
e [email protected]
w www.lnnoventionGroup.ca
Black & White :
This is little bit better than grayscale interms of identifying the space between words, but this also recognizes the borders of the image as letters, and append them to the original/actual text. (See how the red rectangle is prolonged to the edge of the image, as the segmentation mode is set to identify uniform block of text)
Here is the output:
o,
f I t,!h,tig/i,i,,ip,,ip,iy (,
REAL ESTATE i,
SOLUTIONS INC. (i,
Tim Tsai i;, ------> (yay, got the space)
Investor & Consultant ii,
p 780.803.9935 :i,
f 888.803.1485 i:,
e [email protected] (i,
,
-ee_--e_-----e----------ir-eeeereree-e-re---------------, u p
I also tried removing the border, and this time, it didn't read the blank space between the words.
output:
o
I I !,,!ih,tle/IiEhp,tt,l,l),!
REAL ESTATE
SOLUTIONS INC.
TimTsai
Investor & Consultant
p 780.803.9935
f 888.803.1485
e [email protected]
Question:
What is the reason for this behaviour (ignoring spaces between words ?)
In what possible way I can improve this, so that tesseract will not ignore blank spaces all the time ?
I can also look at rotation/deskewing but I'm not sure how much that can improve the performance in this cases as the text looks horizontal to me mostly.
Code:
G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:@"eng"];
tesseract.delegate = self;
tesseract.engineMode=G8OCREngineModeTesseractCubeCombined;
// Optional: Limit the character set Tesseract should try to recognize from
tesseract.charWhitelist = @"@.,&():ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ";
tesseract.charBlacklist=@"$%^*={};<>\\~`";
// Specify the image Tesseract should recognize on
tesseract.image = [img g8_blackAndWhite];
tesseract.sourceResolution=kG8MaxCredibleResolution;
// Optional: Limit the area of the image Tesseract should recognize on to a rectangle
CGRect tessRect = CGRectMake(0, 0, tesseract.image.size.width, tesseract.image.size.height);
tesseract.rect = tessRect;
// Optional: Limit recognition time with a few seconds
tesseract.maximumRecognitionTime = 60.0;
// Start the recognition
[tesseract recognize];
// Retrieve the recognized text
NSLog(@"text %@", [tesseract recognizedText]);
Upvotes: 3
Views: 9863
Reputation: 7634
Set preserve_interword_spaces
to true to preserve multiple spaces between words.
Your code might look like this:
tesseract.setVariable("preserve_interword_spaces", "1");
For the command line interface use the -c
switch this way:
tesseract image.jpg output -c preserve_interword_spaces=1
(Voluntary answer from helpful comments; credits to user nguyenq)
Upvotes: 1