How to extract specific parts of messy PDFs in R?

Question

I need to extract specific parts of a large corpus of PDF documents. The PDFs are large and messy reports containing all kinds of digital, alphabetic and other information. The files are of different length but have unified content and sections across them. The documents have a Table of Content with the section names in them. For example

Table of Content:

Item 1. Business                                                                            1
Item 1A. Risk Factors                                                                       2
Item 1B. Unresolved Staff Comments                                                          5
Item 2. Properties                                                                          10
Item N........

..........text I do not care about...........

Item 1A. Risk Factors 

.....text I am interested in getting.......

(section ends)

Item 1B. Unresolved Staff Comments

..........text I do not care about...........

I have no problem reading them in and analyzing them as a whole but I need to pull out only the text between "Item 1A. Risk Factors" and "Item 1B. Unresolved Staff Comments". I used pdftools, tm, quanteda and readtext package This is the part of code I use to read-in my docs. I created a directory where I placed my PDFs and called it "PDF" and another directory where R will place converted to ".txt" files.

pdf_directory <- paste0(getwd(), "/PDF")
txt_directory <- paste0(getwd(), "/Texts")

Then I create a list of files using "list.files" function.

files <- list.files(pdf_directory, pattern = ".pdf", recursive = FALSE, 
                    full.names = TRUE)
files

After that, I go on to create a function that extracts file names.

extract <- function(filename) {
  print(filename)
  try({
    text <- pdf_text(filename)
  })
  f <- gsub("(.*)/([^/]*).pdf", "\2", filename)
  write(text, file.path(txt_directory, paste0(f, ".txt")))
}


for (file in files) {
  extract(file)
}

After this step, I get stuck and do not know how to proceed. I am not sure if I should try to extract the section of interest when I read data in, therefore, I suppose, I would have to wrestle with the chunk where I create the function -- f <- gsub("(.*)/([^/]*).pdf", "\2", filename)? I apologize for such questions but I am self-teaching myself. I also tried engaging the following code on just one file instead of a corpus:

start <- grep("^\*\*\* ITEM 1A. RISK FACTORS", text_df$text) + 1

stop <- grep("^ITEM 1B. UNRESOLVED STAFF COMMENTS", text_df$text) - 1

lines <- raw[start:stop]

scd <- paste0(".*",start,"(.*)","
",stop,".*")  
gsub(scd,"\1", name_of_file)

but it did not help me in any way.

How to extract specific parts of messy PDFs in R?

Answers (1)

Related Questions