Read text from PDF

Question

I am using pdftools in R to get text from pdf, but I am having several problems geting the information.

With this pdf for example when I try to get the text:

library(pdftools)
pdf_text(paste(ruta, "Factura.pdf"))

(Ruta is where you put the pdf). With this file I don´t get anything. This step works with pdf which are perfect (like this), but when the pdf has scanned information lost accurancy.

Is there any other way to get text from a pdf with R which solve this type of problem?

Thanks

Diya Li · Accepted Answer

The problem is, your example is an image PDF, which is an image just store as PDF.

If you want to extract text from the image PDF. you can use Tesseract

library(tesseract)
eng <- tesseract("eng")
text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng)
cat(text)

Also, you need to convert pdf to img first. Check this answer

im.convert("bm.pdf", output = "bm.png")

Read text from PDF

Answers (1)

Related Questions