pytesseract: good OCR or good Lines - never both

Question

I'm using pytesseract (tesseract version 3.05) to OCR (Optical Character Recognition) a printed PDF bill that is digitally created. I pre-process it to remove any color and set it to pure black and white and 600 DPI. It is proprietary information so I can't post here, but trust me when I say - it is perfectly straight and very clear.

When processing the images, I've been playing with various Page Segmentation Modes (PSM).

A few PSMs (eg. 11 & 12) recognized the characters brilliantly - nearly perfect - but a single line will become multiple lines and often will get shuffled, making data parsing functionally impossible.

Other PSMs (eg. 3 & 4) keep perfect lines (which is helpful for data parsing), but the OCR is terrible (spaces are inserted, dashes become apostrophes, an 'l' will become a '1' or even 'i', etc).

I've tried all PSMs and can't find the version that allows me to keep the lines and the quality OCR.

Are there additional dials I can turn to allow me to do both, and maybe further increase the quality of the resultant text?

Code:

psm_version = 3
text        = pytesseract.image_to_string(b_w_file, lang = 'eng', config = '-psm {}'.format(psm_version))

pytesseract: good OCR or good Lines - never both

Answers (1)

Related Questions