GonzaloReig
GonzaloReig

Reputation: 87

Read text from PDF

I am using pdftools in R to get text from pdf, but I am having several problems geting the information.

With this pdf for example when I try to get the text:

library(pdftools)
pdf_text(paste(ruta, "Factura.pdf"))

(Ruta is where you put the pdf). With this file I don´t get anything. This step works with pdf which are perfect (like this), but when the pdf has scanned information lost accurancy.

Is there any other way to get text from a pdf with R which solve this type of problem?

Thanks

Upvotes: 0

Views: 149

Answers (1)

Diya Li
Diya Li

Reputation: 1088

The problem is, your example is an image PDF, which is an image just store as PDF.

If you want to extract text from the image PDF. you can use Tesseract

library(tesseract)
eng <- tesseract("eng")
text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng)
cat(text)

Also, you need to convert pdf to img first. Check this answer

im.convert("bm.pdf", output = "bm.png")

Upvotes: 1

Related Questions