Reputation: 360
I write in Python, using pytesseract
or direct Popen
calls if needed.
I try to OCR a document with irregular structure, a letter looking like this:
The problem is in the
.hocr
file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"
What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.
Using --psm 1
to allow input format detection - no improvement over default, likely because structure is too complicated.
Tweaking some config file options like gapmap_use_ends
and textord_words_maxspace
- I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.
Editing the .hocr
- not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...
Upvotes: 2
Views: 884