Reputation: 7373
I've been trying to use plain tesseract 3 OCR using different options to get the data from a table of letters where my students marked one as answers for multiple choice questions, as seen below:
One of the best outputs was:
EEEEEEEEEEEEEEEEEEEEEEEEE
DDDDDDDDDDDDDDDDDDDDDDDDD
CCCCCCCCCCCCCCCCCCCCCCCCC
BBBBBBBEBBBBBBBBBBBBBBBBB
AAAAAAAAAAAAAAAAAAAAAAAAA
6789012345678901234567890
2222333333333344444444445
EEEEE EEEE EE EEE EEEEEEE
DDDDDD DDD DDDDDDDDDDDD
CCCCCCCCCCCCCCCCCC CCCCC
B BEBE BB BBBBBBBBBBBBBBB
AA AAA AAAAA AAAAAAAA
1234567890123455789012345
OOOOOOOOO1111111111222222
I know I can parse that .txt and have a better result, but it missed a lot of information and got the letters from some of the painted blocks.
I wanted to know what can I do to get better result for this case.
I would also like to have a table with the painted blocks appearing as a different character, for example, for the first and second lines of the image:
01 A B C - E 26 A B C D E
02 A - C D E 27 A B C D E
If you guys have some similar experience, any information will be appreciated! Thanks in advance!
Upvotes: 4
Views: 3410
Reputation: 9402
First, I suggest you preprocess your image, for example making the dark parts darker, blur it a little. Feel free to experiment until Tesseract stops seeing letters in the filled-in squares.
Second, you have two options:
One, you can enable hOCR output and try to parse the layout of the scanned letters yourself. hOCR is a subset of HTML and it contains coordinates of all recognized words. Try figuring out where the rows and columns are.
Alternatively, try making Tesseract recognise the layout properly, not rotated 90°.
Anyway, this is what I did:
1. I ran the image through ImageMagick:
$ convert CDZjN.png -deskew 40% -contrast-stretch 7%x10% -filter lanczos -resize 250% ooo.png
2. I created a config file t.conf
for Tesseract, disabling vertical text detection and English dictionary:
textord_tabfind_vertical_text 0
load_system_dawg 0
load_freq_dawg 0
load_punc_dawg 0
load_number_dawg 0
load_unambig_dawg 0
load_bigram_dawg 0
load_fixed_length_dawgs 0
3. I simply ran it:
$ tesseract ooo.png ooo t.conf ; cat ooo.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
01ABC-E 26ABCDE
02A CDE 27ABCDE
o3 BCDE 28ABCDE
o4 BCDE 29ABCDE
o5 BCDE 30ABCDE
06ABCD. 31ABCDE
07A-CDE 32ABCDE
08ABC.E 33ABCDE
o9 BCDE 34ABCDE
10A CDE 35ABCDE
11ABCD 36ABCDE
12ABC E 37ABCDE
13ABC E 38ABCDE
14ABCD 39ABCDE
15 BCDE 40ABCDE
1s BCDE 41ABCDE
17 BCDE 42ABCDE
18ABCD_ 43ABCDE
19AB DE 44ABCDE
20AB DE 45ABCDE
21ABCDE 46ABCDE
22ABCDE 47ABCDE
23ABCDE 48ABCDE
24ABCDE 49ABCDE
25ABCDE 50ABCDE
Not perfect, but passable.
Upvotes: 5