Reputation: 2194
Is there any way to tell Tesseract OCR to treat certain characters as word separators in the HOCR output?
For example, say I have a document about the Scranton/Wilkes-Barre RailRiders, and I want the slash to be treated as a word separator. So instead of this output:
<span class='ocrx_word' id='word_1_2' title='bbox 186 324 1201 395; x_wconf 85' lang='eng' dir='ltr'>Scranton/Wilkes-Barre</span>
I need output that looks like this (bboxes are estimated):
<span class='ocrx_word' id='word_1_2' title='bbox 186 324 799 395; x_wconf 85' lang='eng' dir='ltr'>Scranton</span>
<span class='ocrx_word' id='word_1_3' title='bbox 800 324 820 395; x_wconf 85' lang='eng' dir='ltr'>/</span>
<span class='ocrx_word' id='word_1_4' title='bbox 821 324 1201 395; x_wconf 85' lang='eng' dir='ltr'>Wilkes-Barre</span>
I have tried two possible solutions:
Setting "tessedit_char_blacklist" to "/". This did not work as Tesseract simply changed the slash to a lowercase L.
Setting "chs_trailing_punct1" to ").,;:?!/" (the default characters plus the slash). This did not change the output at all.
Upvotes: 0
Views: 301