Reputation: 93
I have 100 scanned PDF files and I need to convert them into text files.
I have first converted them into png files (see script below), now I need help to convert these 100 png files to 100 text files.
library(pdftools)
library("tesseract")
#location
dest <- "P:\\TEST\\images to text"
#making loop for all files
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
#Convert files to png
sapply(myfiles, function(x)
pdf_convert(x, format = "png", pages = NULL,
filenames = NULL, dpi = 600, opw = "", upw = "", verbose = TRUE))
#read files
cat(text)
I expect to have a text file for each png file:
From: file1.png, file2.png, file3.png...
To: file1.txt, file2.txt, file3.txt...
But the actual result is one text file containing all png files text.
Upvotes: 0
Views: 842
Reputation: 20399
I guess you left out the bit with teh png -> text
bit, but I assume you used library(tesseract)
.
You could do the following in your code:
library(tesseract)
eng <- tesseract("eng")
sapply(myfiles, function(x) {
png_file <- gsub("\\.pdf", ".png", x)
txt_file <- gsub("\\.pdf", ".txt", x)
pdf_convert(x, format = "png", pages = 1,
filenames = png_file, dpi = 600, verbose = TRUE)
text <- ocr(png_file, engine = eng)
cat(text, file = txt_file)
## just return the text string for convenience
## we are anyways more interested in the side effects
text
})
Upvotes: 2