Austin
Austin

Reputation: 153

Converting pdf files into data.frames

I'm currently trying to create a function that will read many pdf files into a data frame. My ultimate goal is to have it read specific information from the pdf files and convert them into a data.frame with insurance plan names in each row and the columns comprising of information I need such as individual plan price, family plan prices, etc. I have been following an answer given by someone for a similar question in the past. However, I keep getting an error. Here is a link to two different files I am practicing on(1 and 2).

Here are my code and error below:

PDFtoDF = function(file) {

  dat = readPDF(control=list(text="-layout"))(elem=list(uri=file), 
                                              language="en", id="id1") 
  dat = c(as.character(dat))

  dat = gsub("^ ?([0-9]{1,3}) ?", "\\1|", dat)

  dat = gsub("(, HVOL )","\\1 ", dat)
  dat = gsub(" {2,100}", "|", dat)

  excludeRows = lapply(gregexpr("\\|", dat), function(x) length(x)) != 6
  write(dat[excludeRows], "rowsToCheck.txt", append=TRUE)

  dat = dat[!excludeRows]

  dat = read.table(text=dat, sep="", quote="", stringsAsFactors=FALSE)
  names(dat) = c("Plan", "Individual", "Family")
  return(dat)
}

files <- list.files(pattern = "pdf$")

df = do.call("rbind", lapply(files, PDFtoDF))


    Error in read.table(text = dat, sep = "", quote = "", stringsAsFactors = 
    FALSE) : no lines available in input 

Before this approach, I have been using the pdftools package and regular expressions. This approach worked except it was difficult to clarify a pattern for some parts of the document such as the plan name which is at the top. I was hoping the methodology I'm trying now will help since it will extract the text into separate strings for me.

Upvotes: 1

Views: 2288

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

Here's the best answer:

require(readtext)
df <- readtext("*.pdf")

Yes it's that simple, with the readtext package!

Upvotes: 2

Related Questions