Niki
Niki

Reputation: 23

How to extract specific parts of messy PDFs in R?

I need to extract specific parts of a large corpus of PDF documents. The PDFs are large and messy reports containing all kinds of digital, alphabetic and other information. The files are of different length but have unified content and sections across them. The documents have a Table of Content with the section names in them. For example

Table of Content:

Item 1. Business                                                                            1
Item 1A. Risk Factors                                                                       2
Item 1B. Unresolved Staff Comments                                                          5
Item 2. Properties                                                                          10
Item N........

..........text I do not care about...........

Item 1A. Risk Factors 

.....text I am interested in getting.......

(section ends)

Item 1B. Unresolved Staff Comments

..........text I do not care about...........

I have no problem reading them in and analyzing them as a whole but I need to pull out only the text between "Item 1A. Risk Factors" and "Item 1B. Unresolved Staff Comments". I used pdftools, tm, quanteda and readtext package This is the part of code I use to read-in my docs. I created a directory where I placed my PDFs and called it "PDF" and another directory where R will place converted to ".txt" files.

pdf_directory <- paste0(getwd(), "/PDF")
txt_directory <- paste0(getwd(), "/Texts")

Then I create a list of files using "list.files" function.

files <- list.files(pdf_directory, pattern = ".pdf", recursive = FALSE, 
                    full.names = TRUE)
files

After that, I go on to create a function that extracts file names.

extract <- function(filename) {
  print(filename)
  try({
    text <- pdf_text(filename)
  })
  f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)
  write(text, file.path(txt_directory, paste0(f, ".txt")))
}

for (file in files) {
  extract(file)
}

After this step, I get stuck and do not know how to proceed. I am not sure if I should try to extract the section of interest when I read data in, therefore, I suppose, I would have to wrestle with the chunk where I create the function -- f <- gsub("(.*)/([^/]*).pdf", "\\2", filename)? I apologize for such questions but I am self-teaching myself. I also tried engaging the following code on just one file instead of a corpus:

start <- grep("^\\*\\*\\* ITEM 1A. RISK FACTORS", text_df$text) + 1

stop <- grep("^ITEM 1B. UNRESOLVED STAFF COMMENTS", text_df$text) - 1

lines <- raw[start:stop]

scd <- paste0(".*",start,"(.*)","\n",stop,".*")  
gsub(scd,"\\1", name_of_file)

but it did not help me in any way.

Upvotes: 2

Views: 1575

Answers (1)

JBGruber
JBGruber

Reputation: 12420

I don't really see why you would write files into a txt first, so I did it all in one go.

What threw me off a little is that your patterns have lots of extra spaces. You can match them with the regular expression \\s+

library(stringr)
files <- c("https://corporate.exxonmobil.com/-/media/Global/Files/investor-relations/investor-relations-publications-archive/ExxonMobil-2016-Form-10-K.pdf",
           "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf")


relevant_l <- lapply(files, function(file) {
  
  # print status message
  message("processing: ", basename(file))
  
  lines <- unlist(stringr::str_split(pdftools::pdf_text(file), "\n"))
  start <- stringr::str_which(lines, "ITEM 1A.\\s+RISK FACTORS")
  end <- stringr::str_which(lines, "ITEM 1B.\\s+UNRESOLVED STAFF COMMENTS")
  
  # cover a few different outcomes depending on what was found
  if (length(start) == 1 & length(end) == 1) {
    relevant <- lines[start:end]
  } else if (length(start) == 0 | length(end) == 0) {
    relevant <- "Pattern not found"
  } else {
    relevant <- "Problems found"
  }
  
  return(relevant)
})
#> processing: ExxonMobil-2016-Form-10-K.pdf
#> processing: dummy.pdf

names(relevant_l) <- basename(files)
sapply(relevant_l, head)
#> $`ExxonMobil-2016-Form-10-K.pdf`
#> [1] "ITEM 1A.           RISK FACTORS\r"                                                                                                   
#> [2] "ExxonMobil’s financial and operating results are subject to a variety of risks inherent in the global oil, gas, and petrochemical\r" 
#> [3] "businesses. Many of these risk factors are not within the Company’s control and could adversely affect our business, our financial\r"
#> [4] "and operating results, or our financial condition. These risk factors include:\r"                                                    
#> [5] "Supply and Demand\r"                                                                                                                 
#> [6] "The oil, gas, and petrochemical businesses are fundamentally commodity businesses. This means ExxonMobil’s operations and\r"         
#> 
#> $dummy.pdf
#> [1] "Pattern not found"

I would return the results as a list and then use original file names to name the list elements. Let me know if you have questions. I use the package stringr since it's fast and consistent in dealing with strings. But the command str_which and grep are pretty the same.

Upvotes: 2

Related Questions