Reputation: 87
I am using pdftools in R to get text from pdf, but I am having several problems geting the information.
With this pdf for example when I try to get the text:
library(pdftools)
pdf_text(paste(ruta, "Factura.pdf"))
(Ruta is where you put the pdf). With this file I don´t get anything. This step works with pdf which are perfect (like this), but when the pdf has scanned information lost accurancy.
Is there any other way to get text from a pdf with R which solve this type of problem?
Thanks
Upvotes: 0
Views: 149
Reputation: 1088
The problem is, your example is an image PDF, which is an image just store as PDF.
If you want to extract text from the image PDF. you can use Tesseract
library(tesseract)
eng <- tesseract("eng")
text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng)
cat(text)
Also, you need to convert pdf to img first. Check this answer
im.convert("bm.pdf", output = "bm.png")
Upvotes: 1