Reputation: 87
I am loading text from some images. With some of them, I am having problems, with this type of image
library(magick)
library(tesseract)
image_read(fichero.jpg) %>%
tesseract::ocr(engine = tesseract("eng")) %>%
cat()
I am assuming (correct me if not) that tesseract fail because of the low quality of the image (it is a scanned document), and I don´t know if there is a way to make the image better.
I tried also some convultion methods with several kernels, trying to reduce the noise of the photo, but it was worse.
Is there a way to handle this or I have to assume that is not possible to get the text in this quality-images?
Regards
Upvotes: 0
Views: 247
Reputation: 21647
It looks like you are trying to create a cow from ground beef. The big problem is that JPEG is not suited for this type of non-photographic image. Your png looks fine because it is a lossless format.
If you don't want this problem, do not save the files as JPEG.
Upvotes: 0
Reputation: 355
Looking at this with the experience of a photographer rather than as a programmer, I would guess that the poor focus and camera jiggle make this image pretty well unreadable by most OCR options. I just used the OCR in Adobe Acrobat to play with it on my own PC and I could get "FECHA" to recognize, but not "NUMERO" and not any of the numbers.
I pulled it into a photo editor and messed around with the contrast, as sometimes it's possible to convert a grayscale image such as this to pure black-and-white and get rid of some of the fuzziness, but I couldn't produce a readable image in my quick-and-dirty experiment.
So realistically, you'll need images that are scanned/photographed with higher resolution and better contrast to get reliable OCR.
Upvotes: 0