Reputation: 18712
I have the following screenshot:
I want to extract the manuscript word count, 3.574
in this case, from that image (see red rectangle below).
To do this, I run following script:
magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
tesseract screenshot-cropped.png screenshot-ocred -l eng
The first line cuts out the place with the word count and saves it in screenshot-cropped.png
which looks like this:
tesseract screenshot-cropped.png screenshot-ocred -l eng
is supposed to recognize the characters and save them as text in screenshot-ocred.txt
.
However, it produces the following error:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>ocr.bat
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick screenshot.png -crop 33x20+2+83 screenshot-cropped.png
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>tesseract screenshot-cropped.png screenshot-ocred -l eng
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Empty page!!
Empty page!!
How can I fix it, i. e. make Tesseract recognize 3.574
and save it in screenshot-ocred.txt
?
Note: All of this runs on Windows. Here is the output of magick --version
:
C:\usr\dp\ref\marcomm\2020_04_22_wordCounter>magick --version
Version: ImageMagick 7.0.10-7 Q16 x64 2020-04-20 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2018 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Visual C++: 180040629
Features: Cipher DPC Modules OpenCL OpenMP(2.0)
Delegates (built-in): bzlib cairo flif freetype gslib heic jng jp2 jpeg lcms lqr lzma openexr pangocairo png ps raw rsvg tiff webp xml zlib
Upvotes: 0
Views: 1284
Reputation: 18712
Adding --psm 7
to the Tesseract call solved the problem (tesseract screenshot-cropped.png screenshot-ocred -l eng --psm 7
).
Upvotes: 1