Null Pointers etc.
Null Pointers etc.

Reputation: 2194

How do I make slashes act as word separators in HOCR output (Tesseract OCR)?

Is there any way to tell Tesseract OCR to treat certain characters as word separators in the HOCR output?

For example, say I have a document about the Scranton/Wilkes-Barre RailRiders, and I want the slash to be treated as a word separator. So instead of this output:

<span class='ocrx_word' id='word_1_2' title='bbox 186 324 1201 395; x_wconf 85' lang='eng' dir='ltr'>Scranton/Wilkes-Barre</span>

I need output that looks like this (bboxes are estimated):

<span class='ocrx_word' id='word_1_2' title='bbox 186 324 799 395; x_wconf 85' lang='eng' dir='ltr'>Scranton</span>
<span class='ocrx_word' id='word_1_3' title='bbox 800 324 820 395; x_wconf 85' lang='eng' dir='ltr'>/</span>
<span class='ocrx_word' id='word_1_4' title='bbox 821 324 1201 395; x_wconf 85' lang='eng' dir='ltr'>Wilkes-Barre</span>

I have tried two possible solutions:

  1. Setting "tessedit_char_blacklist" to "/". This did not work as Tesseract simply changed the slash to a lowercase L.

  2. Setting "chs_trailing_punct1" to ").,;:?!/" (the default characters plus the slash). This did not change the output at all.

Upvotes: 0

Views: 301

Answers (0)

Related Questions