Reputation: 1425
I am doing the process of cleaning up and image using leptonica and then passing it to tesseract for OCR.However it is not able to recognize the characters even though the image is of high quality.The image specifications are as follows.
1 bpp, uncompressed, 1280 * 960 , 300dpi horizontal and vertical resolution
Following are the image processing operations I carry out in sequence using leptonica
pixConvertTo8
pixBackgroundNormSimple
pixOtsuAdaptiveThreshold
pixContrastTRC {Regarding this - I am passing high values like 1.0 or even 5.0 but image doesnt really change}
pixFindSkew
pixRotate { rotate by angle found by pixFindSkew}
pixRotate90 {do this 4 times to read image in all 4 orientations}
pixClipRectangle {crop image}
Finally tesseract command
I get garbage characters in the output.A sample Input Image is as follows.
The output that i get is as follows
Final K-1
II]
s h d | K-1 ,.,
(F°o.~?n‘i&1) 5/>.©12 mm E2‘;
Deparlrnenl of tho Treasury , ,
I 1 I l I
‘mama, Ravenuo SGMW For cnlundm your 201), ‘ " °F°$ "'100fTIO
or lax yum boqmnnnq 7 _ 20\Q_
‘ 7660
and ondmg _ W vv I go
Beneï¬ciary's Share of Income, Deductions,
cl'editS, etc. F 800 buck 01 loam nnd lnstruoflons»
___lnformatI0n About mo Estate or Trust
‘ Ordmary d|v|dm
i 12113
_
‘; Quahfmd dlVIdG
\ 8132
3 1
Net shun-term
A Estate's at trust's omgiuym ldonnlmnluon numbol
56-0987654
B Estate's u trust‘: namo
ESTATE OF MARTHA SMITH
0 Fiduc§ary's name, address, clly, smlu‘ and /IP codo
N01 long~lerm c
\ 24043
u
‘ 28% vale gann
Ti
Unreptumd 5
Omar porfloho 4
nonbuslness lfll
/\..4........ L. ._.._ ,.
What Should i do to improve the accuracy.
Part 2:
I tried to follow this link.And created a eng.user-words.traineddata file and bazaar.train file and tried to run with "bazaar" as additional parameter.but i get "read_params_file: can't open bazaar error". Any suggestions?
Upvotes: 0
Views: 1665
Reputation: 11
For part one,
I don't know if the image you posted up here is the actual one you've been trying to scan but when I tried it, I got this:-
Department oi the Treasury Internal Revenue Service
For cnlundm your V019, 1 ‘ '"l0T°5' |nC0m0
or tax yam boqlnnlnq , 2o12_ ‘ 7660 and ondlng I go 2: ‘ Ordinary dlvndm " “T ' x 12113
1; Quali?ed dwnda ‘ 8132 Netshun-term:
M Not long ~terrn c
i 24043 Ab ‘ 2896 ralagann
Bene?ciary’s Share of Income, Deductions, Cfedits, etc. 5 800 back oi form nnd Instruc?ons
| Partl Information About the state or Trust
A Estate's or IvLsl's omuoym Idonnlncnluon numhu
56-0987654
8 Estate‘: a trust‘: namo
ESTATE OF MARTHA SMITH
M: Unreptumd 5
017161 portioho : nonbuslness Inl
C Fiduc§ary's name, address, city, smlul an-(V1/If’ Eooo
It's not great but it seems a bit better than what you got. I'm using Tesseract v3 on Windows. My basic command was:
- tesseract.exe nnm.tif nnm
For part two,
your bazaar
file should be in the configs
folder
.....\Tesseract-OCR\tessdata\configs\bazaar
and there's some requirements for it to be saved in a particular format, like UTF8
with only a LF
at the end of the line not a CR + LF
, it seems to be quite fussy about the file formats.
you can get a copy of it from http://code.metager.de/source/raw/google/tesseract-ocr/tessdata/configs/bazaar
I made a digits config file that I used for scanning some images where I was only interested in the numbers and that worked fine:
- tesseract.exe scanfile.jpg scanfile digits
The documentation for Tesseract
is pretty poor and it doesn't work well on a PC.
Upvotes: 1
Reputation: 535
For part one,
I think you should consider the preprocessing done by Capture2Text. It is using both Leptonica and Tesseract to OCR the images.
I am not sure about part 2.
Upvotes: 0