Reputation: 1443
I'm trying to convert a series of scanned PDF into searchable PDF using the tesseract
and pdftools
packages. I've accomplished two steps. Now I need to write back to a searchable pdf.
eg <- download.file("https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf", "example.pdf", mode = "wb")
results <- tesseract::ocr_data("example.pdf", engine = "eng")
R> results
# A tibble: 406 x 3
word confidence bbox
<chr> <dbl> <chr>
1 PFU 96.9 228,181,404,249
2 Business 96.2 459,180,847,249
3 report 96.2 895,182,1145,259
4 | 52.5 3980,215,3984,222
5 No.068 91.0 4439,163,4754,237
6 New 96.0 493,503,1005,687
7 customer's 94.6 1069,484,2231,683
8 development 96.5 2304,483,3714,732
9 di 90.4 767,763,1009,959
10 ing 96.3 1754,773,1786,807
# ... with 396 more rows
Alternatively, is there another package or command-line tool I can invoke in R for Windows?
Upvotes: 3
Views: 1074
Reputation: 2243
If you have the software ECopy installed on your computer (not a free software), you can use the following function to convert scanned pdfs to searchable pdfs:
ecopy_Scanned_PDF_To_Numeric_PDF <- function(directory_Scanned_PDF, directory_Numeric_PDF)
{
path_To_BatchConverter <- "C:/Program Files (x86)/Nuance/eCopy PDF Pro Office 6/BatchConverter.com"
args <- paste0("-I", directory_Scanned_PDF, "\\*.pdf -O", directory_Numeric_PDF, " -Tpdfs -Lfre -W -V1.5 -J -Ao")
system2(path_To_BatchConverter, args = args)
}
I use this function at my job and it works very well
Upvotes: 1
Reputation: 2243
Here is one approach based on the RDCOMClient R package. Basically, we convert the PDF to Word. In the process, Word uses an embedded OCR. Afterwards, with the Word software, we convert the Word document to a searchable PDF.
library(RDCOMClient)
download.file("https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf", "example.pdf", mode = "wb")
path_PDF <- "C:/example.pdf"
path_Word <- "C:/example.docx"
################################################################
#### Step 1 : Convert PDF to word document with OCR of Word ####
################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
doc_Selection <- wordApp$Selection()
##########################################################
#### Step 3 : Convert word document to searchable pdf ####
##########################################################
path_PDF_Searchable <- "C:/example_searchable.pdf"
wordApp[["ActiveDocument"]]$SaveAs(path_PDF_Searchable, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp
Upvotes: 1
Reputation: 859
I had a similar need and wrote a simple function in R to call the command line for OCRmyPDF.
I'm using Ubuntu, so first install OCRmyPDF in Ubuntu via:
sudo apt install ocrmypdf
Here's the info for installing it on other operating systems.
Then load up the R function in R by running:
ocr_my_pdf <- function(path_read, ..., path_save = NULL){
path_read <- here::here(path_read)
if(is.null(path_save)){
path_save <- stringr::str_replace(path_read, '(?i)\\.pdf$','_ocr.pdf')
} else {
path_save <- here::here(path_save)
}
sys_args <- c(
glue::glue("'{unlist(list(...))}'"),
glue::glue("'{path_read}'"),
glue::glue("'{path_save}'"))
system2('ocrmypdf', args = sys_args)
}
Then call the function on a test PDF with:
ocr_my_pdf('/home/test.pdf')
Or, with whatever additional arguments you want to pass:
ocr_my_pdf('test.pdf', '--deskew', '--clean', '--rotate-pages')
Here's the info for available arguments.
Upvotes: 2