Reputation: 71
I am trying to detect the .pdf file text.
They are first converted to an image, then given to Tesseract.
The detection is good but they make too many line breaks.
For example if the file is a bit panched on the right, the sentence:
"I like Tesseract for reading text"
become:
"text read for Tesseract like I"
And that's already after a treatment because the raw text is :
"text
read
for
Tesseract
like
I"
The bug occurs since the source .pdf are in 300DPI, I understand that the problem comes from the resolution but I cannot find how to solve it.
Here is my Tesseract cmd Tesseract.exe dummy.pdf dumy-ocr.pdf --psm 12 --dpi 300 -l bvr+fra+eng+deu hocr pdf
First, I would like to solve the problem of too many lines,
Then I would find out how to make the image perfectly straight
Thank you in advance for your help
https://i.sstatic.net/crmdO.jpg
Upvotes: 1
Views: 627
Reputation: 11737
You seem to be working backwards. The "many" lines and thus word reversal are due to the anti-clockwise rotation.
text"
reading
for
Tesseract
like
"I
Fix that first and then the words will naturally all be placed on the same lines.
If using Leptonica in conjunction with Tesseract it is supposed to help with the pre-processing including deskew.
However there is a very small but powerful open source GUI and Command Line tool for Windows, Linux, and macOS that you could use from a shell see https://galfar.vevb.net/wp/projects/deskew/ it is also available on GitHub as an appveyor CI artifact so for the most up to date version (currently 5 days ago) follow the green tick at https://github.com/galfar/deskew
Upvotes: 1