GonzaloReig
GonzaloReig

Reputation: 87

How reduce the noise of a image?

I am loading text from some images. With some of them, I am having problems, with this type of image

library(magick)
library(tesseract)
image_read(fichero.jpg) %>%
  tesseract::ocr(engine = tesseract("eng")) %>%
  cat()

Result

I am assuming (correct me if not) that tesseract fail because of the low quality of the image (it is a scanned document), and I don´t know if there is a way to make the image better.

I tried also some convultion methods with several kernels, trying to reduce the noise of the photo, but it was worse.

Is there a way to handle this or I have to assume that is not possible to get the text in this quality-images?

Regards

Upvotes: 0

Views: 247

Answers (2)

user3344003
user3344003

Reputation: 21647

It looks like you are trying to create a cow from ground beef. The big problem is that JPEG is not suited for this type of non-photographic image. Your png looks fine because it is a lossless format.

If you don't want this problem, do not save the files as JPEG.

Upvotes: 0

DanM
DanM

Reputation: 355

Looking at this with the experience of a photographer rather than as a programmer, I would guess that the poor focus and camera jiggle make this image pretty well unreadable by most OCR options. I just used the OCR in Adobe Acrobat to play with it on my own PC and I could get "FECHA" to recognize, but not "NUMERO" and not any of the numbers.

I pulled it into a photo editor and messed around with the contrast, as sometimes it's possible to convert a grayscale image such as this to pure black-and-white and get rid of some of the fuzziness, but I couldn't produce a readable image in my quick-and-dirty experiment.

So realistically, you'll need images that are scanned/photographed with higher resolution and better contrast to get reliable OCR.

Upvotes: 0

Related Questions