Tesseract : Line detection too sensitive

Question

I am trying to detect the .pdf file text. They are first converted to an image, then given to Tesseract. The detection is good but they make too many line breaks. For example if the file is a bit panched on the right, the sentence:
"I like Tesseract for reading text"
become:
"text read for Tesseract like I"
And that's already after a treatment because the raw text is :
"text
read
for
Tesseract
like
I"
The bug occurs since the source .pdf are in 300DPI, I understand that the problem comes from the resolution but I cannot find how to solve it. Here is my Tesseract cmd Tesseract.exe dummy.pdf dumy-ocr.pdf --psm 12 --dpi 300 -l bvr+fra+eng+deu hocr pdf
First, I would like to solve the problem of too many lines, Then I would find out how to make the image perfectly straight
Thank you in advance for your help

https://i.sstatic.net/crmdO.jpg

K J · Accepted Answer

You seem to be working backwards. The "many" lines and thus word reversal are due to the anti-clockwise rotation.

                              text"
                      reading 
                  for 
        Tesseract 
   like 
"I

Fix that first and then the words will naturally all be placed on the same lines.

If using Leptonica in conjunction with Tesseract it is supposed to help with the pre-processing including deskew.

However there is a very small but powerful open source GUI and Command Line tool for Windows, Linux, and macOS that you could use from a shell see https://galfar.vevb.net/wp/projects/deskew/ it is also available on GitHub as an appveyor CI artifact so for the most up to date version (currently 5 days ago) follow the green tick at https://github.com/galfar/deskew

Tesseract : Line detection too sensitive

Answers (1)

Related Questions