Lod
Lod

Reputation: 639

OCR on PDF with Tesseract in R, writing TIFF - error

For a small project I am trying to read some data from scanned PDF files that do not contain the data.

Following the instructions of the Tesseract package, the code below should work. Unfortunately it triggers an error.

Error in tiff::writeTIFF(bitmap, "page.tiff") : INTEGER() can only be applied to a 'integer', not a 'raw'

Any clue on how this can be resolved?

library(pdftools)
library(tiff)
library(tesseract)

# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]

# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")

# Extract text from images
out <- ocr("page.tiff")
cat(out)

Upvotes: 1

Views: 862

Answers (1)

Adi Sarid
Adi Sarid

Reputation: 819

Perhaps using pdf_convert() instead of pdf_render_page(), i.e.:

library(pdftools)

# A PDF file with some text
setwd(tempdir())
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]

# Render pdf to jpeg/tiff image
pdf_convert(news, format = "tiff")

This generates multiple tiffs in the directory so you should add a code that reads and processes all of them one by one.

Upvotes: 1

Related Questions