Reputation: 8996
I'm using pytesseract (tesseract version 3.05) to OCR (Optical Character Recognition) a printed PDF bill that is digitally created. I pre-process it to remove any color and set it to pure black and white and 600 DPI. It is proprietary information so I can't post here, but trust me when I say - it is perfectly straight and very clear.
When processing the images, I've been playing with various Page Segmentation Modes (PSM).
A few PSMs (eg. 11 & 12) recognized the characters brilliantly - nearly perfect - but a single line will become multiple lines and often will get shuffled, making data parsing functionally impossible.
Other PSMs (eg. 3 & 4) keep perfect lines (which is helpful for data parsing), but the OCR is terrible (spaces are inserted, dashes become apostrophes, an 'l' will become a '1' or even 'i', etc).
I've tried all PSMs and can't find the version that allows me to keep the lines and the quality OCR.
Are there additional dials I can turn to allow me to do both, and maybe further increase the quality of the resultant text?
Code:
psm_version = 3
text = pytesseract.image_to_string(b_w_file, lang = 'eng', config = '-psm {}'.format(psm_version))
Upvotes: 5
Views: 1100
Reputation: 58
I'm not familiar with pytesseract but I have messed around with the C# port pretty extensively. I am feeding it .tiffs and the irony is that the higher the DPI I make the .tiff, the worse Tesseract seemingly performs. I found the sweet spot at like 119 DPI. The solution I have found that works is that I create two .tiffs, 1 high DPI which is for my output and 1 low DPI that I feed to Tesseract. I have the Tesseract iterator pass me the coordinates of the bounding boxes its find and then I use those coordinates on the higher DPI .tiff to do what I am trying to do. Its not the most efficient process so I have since moved on to other options and do not have the code anymore. Hope this helps!
Upvotes: 1