How to improve OCR accuracy?

Question

I have 2 images like shown below. A.png is perfectly read by tesseract but B.png is terribly bad accuracy even though the B.png is similar to A.png. How can I improve the accuracy? I have no idea where to start debugging?

A.png

B.png

Run OCR

# tesseract -v
tesseract 4.1.1-rc2-22-g08899

# tesseract A.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
第 3 期 決算 公告 令 和 2 年 2 月 7 日
大 阪 市 中 央 区 南 新町 一 丁目 3 番 10 号
株 式 会 社 Link_Mobile

代表 取締 役 佐々 木 勉

貸借 対照 表 の 要旨 (平成 31 年 3 月 31 日 現在 }

# tesseract B.png stdout -l jpn --psm 6
Warning: Invalid resolution 0 dpi. Using 70 instead.
。 人 加計
区 三 6 番 12 号
中 野 駅 前 ビル 5 | 、
am 人 mw
に て
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 }

Update 1

Were both scanned using the same scanner, and at the same resolution?

Yes. The images that were originally included in the same PDF were cut out.

Are you taking advantage of any APIs which Tesseract exposes for pre-processing the images before doing OCR?

No. I did not know that. I am checking now about it.

zono · Accepted Answer

It improved. I read "Tesseract documentation" and rescaled the image.

Rescaling Tesseract works best on images which have a DPI of at least 300 dpi, so it may be beneficial to resize images. For more information see the FAQ.

Rescaled image

Run OCR

# tesseract B2.png stdout -l jpn --psm 6
第 54 期 決 算 公 告 _ 令 和 2 年 1 月 29 日
東京 都 中 野 区 中 野 三 丁目 36 番 12 号
中 野 駅 前 ビル 5 F
株 式 会 社 コ ー エ ー テ クニ カ
代表 取締 役 小 空 _ 修
貸借 対照 表 の 要旨 ( 令 和 元 年 11 月 30 日 現在 )

How to improve OCR accuracy?

Answers (1)

Related Questions