Dedalus
Dedalus

Reputation: 360

Limit space size in Tesseract

I write in Python, using pytesseract or direct Popen calls if needed.

I try to OCR a document with irregular structure, a letter looking like this: enter image description here The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"

What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.

Tried/ideas:

Using --psm 1 to allow input format detection - no improvement over default, likely because structure is too complicated.

Tweaking some config file options like gapmap_use_ends and textord_words_maxspace - I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.

Editing the .hocr - not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...

Upvotes: 2

Views: 884

Answers (0)

Related Questions