OCR on PDF with Tesseract in R, writing TIFF - error

Question

For a small project I am trying to read some data from scanned PDF files that do not contain the data.

Following the instructions of the Tesseract package, the code below should work. Unfortunately it triggers an error.

Error in tiff::writeTIFF(bitmap, "page.tiff") : INTEGER() can only be applied to a 'integer', not a 'raw'

Any clue on how this can be resolved?

library(pdftools)
library(tiff)
library(tesseract)

# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]

# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")

# Extract text from images
out <- ocr("page.tiff")
cat(out)

Adi Sarid · Accepted Answer

Perhaps using pdf_convert() instead of pdf_render_page(), i.e.:

library(pdftools)

# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]

# Render pdf to jpeg/tiff image
pdf_convert(news, format = "tiff")

This generates multiple tiffs in the directory so you should add a code that reads and processes all of them one by one.

OCR on PDF with Tesseract in R, writing TIFF - error

Answers (1)

Related Questions