nnm
nnm

Reputation: 1425

Tesseract not able to recognize characters even for a high quality Image

I am doing the process of cleaning up and image using leptonica and then passing it to tesseract for OCR.However it is not able to recognize the characters even though the image is of high quality.The image specifications are as follows.

1 bpp, uncompressed, 1280 * 960 , 300dpi horizontal and vertical resolution

Following are the image processing operations I carry out in sequence using leptonica

pixConvertTo8
pixBackgroundNormSimple
pixOtsuAdaptiveThreshold
pixContrastTRC {Regarding this - I am passing high values like 1.0 or even 5.0 but image doesnt really change}
pixFindSkew
pixRotate { rotate by angle found by pixFindSkew}
pixRotate90 {do this 4 times to read image in all 4 orientations}
pixClipRectangle {crop image}
Finally tesseract command

I get garbage characters in the output.A sample Input Image is as follows. enter image description here

The output that i get is as follows

Final K-1
II]
s h d | K-1 ,.,
(F°o.~?n‘i&1) 5/>.©12 mm E2‘;
Deparlrnenl of tho Treasury , ,
I 1 I l I
‘mama, Ravenuo SGMW For cnlundm your 201), ‘ " °F°$ "'100fTIO
or lax yum boqmnnnq 7 _ 20\Q_
‘ 7660
and ondmg _  W vv I go
Beneï¬ciary's Share of Income, Deductions,
cl'editS, etc. F 800 buck 01 loam nnd lnstruoflons»
___lnformatI0n About mo Estate or Trust
‘ Ordmary d|v|dm
i 12113
 _
‘; Quahfmd dlVIdG
\ 8132
3 1
Net shun-term
A Estate's at trust's omgiuym ldonnlmnluon numbol
56-0987654
B Estate's u trust‘: namo
ESTATE OF MARTHA SMITH
0 Fiduc§ary's name, address, clly, smlu‘ and /IP codo
N01 long~lerm c
\ 24043 
u 
‘ 28% vale gann
Ti
Unreptumd 5
Omar porfloho 4
nonbuslness lfll
/\..4........ L. ._.._ ,.

What Should i do to improve the accuracy.

Part 2:

I tried to follow this link.And created a eng.user-words.traineddata file and bazaar.train file and tried to run with "bazaar" as additional parameter.but i get "read_params_file: can't open bazaar error". Any suggestions?

Upvotes: 0

Views: 1665

Answers (2)

BigSte
BigSte

Reputation: 11

For part one,

I don't know if the image you posted up here is the actual one you've been trying to scan but when I tried it, I got this:-


Department oi the Treasury Internal Revenue Service

For cnlundm your V019, 1 ‘ '"l0T°5' |nC0m0

or tax yam boqlnnlnq , 2o12_ ‘ 7660 and ondlng I go 2: ‘ Ordinary dlvndm " “T ' x 12113

1; Quali?ed dwnda ‘ 8132 Netshun-term:

M Not long ~terrn c

i 24043 Ab ‘ 2896 ralagann

Bene?ciary’s Share of Income, Deductions, Cfedits, etc. 5 800 back oi form nnd Instruc?ons

| Partl Information About the state or Trust

A Estate's or IvLsl's omuoym Idonnlncnluon numhu

56-0987654

8 Estate‘: a trust‘: namo

ESTATE OF MARTHA SMITH

M: Unreptumd 5

017161 portioho : nonbuslness Inl

C Fiduc§ary's name, address, city, smlul an-(V1/If’ Eooo


It's not great but it seems a bit better than what you got. I'm using Tesseract v3 on Windows. My basic command was:

-    tesseract.exe  nnm.tif  nnm

For part two,

your bazaar file should be in the configs folder

 .....\Tesseract-OCR\tessdata\configs\bazaar

and there's some requirements for it to be saved in a particular format, like UTF8 with only a LF at the end of the line not a CR + LF, it seems to be quite fussy about the file formats.

you can get a copy of it from http://code.metager.de/source/raw/google/tesseract-ocr/tessdata/configs/bazaar

I made a digits config file that I used for scanning some images where I was only interested in the numbers and that worked fine:

-   tesseract.exe  scanfile.jpg  scanfile  digits 

The documentation for Tesseract is pretty poor and it doesn't work well on a PC.

Upvotes: 1

Novice
Novice

Reputation: 535

For part one,

I think you should consider the preprocessing done by Capture2Text. It is using both Leptonica and Tesseract to OCR the images.

I am not sure about part 2.

Upvotes: 0

Related Questions