How do I make slashes act as word separators in HOCR output (Tesseract OCR)?

Question

Is there any way to tell Tesseract OCR to treat certain characters as word separators in the HOCR output?

For example, say I have a document about the Scranton/Wilkes-Barre RailRiders, and I want the slash to be treated as a word separator. So instead of this output:

Scranton/Wilkes-Barre

I need output that looks like this (bboxes are estimated):

Scranton
/
Wilkes-Barre

I have tried two possible solutions:

Setting "tessedit_char_blacklist" to "/". This did not work as Tesseract simply changed the slash to a lowercase L.
Setting "chs_trailing_punct1" to ").,;:?!/" (the default characters plus the slash). This did not change the output at all.

How do I make slashes act as word separators in HOCR output (Tesseract OCR)?

Answers (0)

Related Questions