elPastor
elPastor

Reputation: 8996

pytesseract: good OCR or good Lines - never both

I'm using pytesseract (tesseract version 3.05) to OCR (Optical Character Recognition) a printed PDF bill that is digitally created. I pre-process it to remove any color and set it to pure black and white and 600 DPI. It is proprietary information so I can't post here, but trust me when I say - it is perfectly straight and very clear.

When processing the images, I've been playing with various Page Segmentation Modes (PSM).

A few PSMs (eg. 11 & 12) recognized the characters brilliantly - nearly perfect - but a single line will become multiple lines and often will get shuffled, making data parsing functionally impossible.

Other PSMs (eg. 3 & 4) keep perfect lines (which is helpful for data parsing), but the OCR is terrible (spaces are inserted, dashes become apostrophes, an 'l' will become a '1' or even 'i', etc).

I've tried all PSMs and can't find the version that allows me to keep the lines and the quality OCR.

Are there additional dials I can turn to allow me to do both, and maybe further increase the quality of the resultant text?

Code:

psm_version = 3
text        = pytesseract.image_to_string(b_w_file, lang = 'eng', config = '-psm {}'.format(psm_version))

Upvotes: 5

Views: 1100

Answers (1)

Will Jackson
Will Jackson

Reputation: 58

I'm not familiar with pytesseract but I have messed around with the C# port pretty extensively. I am feeding it .tiffs and the irony is that the higher the DPI I make the .tiff, the worse Tesseract seemingly performs. I found the sweet spot at like 119 DPI. The solution I have found that works is that I create two .tiffs, 1 high DPI which is for my output and 1 low DPI that I feed to Tesseract. I have the Tesseract iterator pass me the coordinates of the bounding boxes its find and then I use those coordinates on the higher DPI .tiff to do what I am trying to do. Its not the most efficient process so I have since moved on to other options and do not have the code anymore. Hope this helps!

Upvotes: 1

Related Questions