Reputation: 153
My company documents summaries of policies/services for each client in a pdf formatted file. These files are combined into a large dataset each year. One row per client and columns are variables in the client's document. There are a couple thousand of these files and each one has approximately 20-30 variables each. I' want to automate this process by creating a data.frame with each row representing a client, and then pull the variables for each client from their pdf document. I'm able to create a list or data.frame of all the clients by the pdf filename in a directory but don't know how to create a loop that pulls each variable I need for each document. I currently have two different methods which I can't decide between, and also need help with a loop that grabs the variables I need for each client document. My code and links to two mock files are provided below. Any help would be appreciated!
Method 1: pdftools
The benefit of the first method is it extracts the entire pdf into a vector, and each page into a separate element. This makes it easier for me to pull strings/variables. However, don't know how to loop it to pull the information from each client and appropriately place it in a column for each client.
library(pdftools)
library(stringr)
Files <- list.files(path="...", pattern=".pdf")
Files <- Files %% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting")) #Extract the first variable
Method 2:
The benefit of this approach is it automatically creates a database for each of the client documents with file name as a row, and the each pdf in a variable. The downside is an entire pdf in a variable makes it more difficult to match and extract strings compared to having each page in its own element. I don't know how to write a loop that will extract variables for each client and place them in their respective column.
DF <- readtext("directory pathway/*.pdf")
DF <- DF %>% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting"))
Upvotes: 2
Views: 93
Reputation: 1268
Here's a basic framework that I think solves your problem using your proposed Method 1.
library(pdftools)
library(stringr)
Files <- list.files(path="pdfs/", pattern=".pdf")
lf <- length(Files)
client_df <- data.frame(client = rep(NA, lf), fr = rep(NA, lf))
for(i in 1:lf){
# extract the text from the pdf
f <- pdf_text(paste0("pdfs/", Files[i]))
# remove commas from numbers
f <- gsub(',', '', f)
# extract variables
client_name <- str_match(f[1], "Client\\s+\\d+")[[1]]
fr <- as.numeric(str_match(f[1], "\\$(\\d+)\\s+Financial Reporting")[[2]])
# add variables to your dataframe
client_df$client[i] <- client_name
client_df$fr[i] <- fr
}
I removed commas from the text under the assumption that any numeric variables you extract you'll want to use as numbers in some analysis. This removes all commas though, so if those are important in other areas you'll have to rethink that.
Also note that I put the sample PDFs into a directory called 'pdfs'.
I would imagine that with a little creative regex you can extract anything else that would be useful. Using this method makes it easy to scrape the data if the elements of interest will always be on the same pages across all documents. (Note the index on f
in the str_match
lines.) Hope this helps!
Upvotes: 2