Reputation: 639
For a small project I am trying to read some data from scanned PDF files that do not contain the data.
Following the instructions of the Tesseract package, the code below should work. Unfortunately it triggers an error.
Error in tiff::writeTIFF(bitmap, "page.tiff") : INTEGER() can only be applied to a 'integer', not a 'raw'
Any clue on how this can be resolved?
library(pdftools)
library(tiff)
library(tesseract)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")
# Extract text from images
out <- ocr("page.tiff")
cat(out)
Upvotes: 1
Views: 862
Reputation: 819
Perhaps using pdf_convert()
instead of pdf_render_page()
, i.e.:
library(pdftools)
# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
pdf_convert(news, format = "tiff")
This generates multiple tiffs in the directory so you should add a code that reads and processes all of them one by one.
Upvotes: 1