Reputation: 20342
I see it is super-easy to grab a PDF file, save it, and fetch all the text from the file.
library(pdftools)
download.file("http://www2.sas.com/proceedings/sugi30/085-30.pdf", "sample.pdf", mode = "wb")
txt <- pdf_text("sample.pdf")
I am wondering how to loop through an array of PDF files, based on links, download each, and scrape the test from each. I want to go to the following link.
http://www2.sas.com/proceedings/sugi30/toc.html#dp
Then I want to download each file from 'Paper 085-30:' to 'Paper 095-30:'. Finally, I want to scrape the text out of each file. How can I do that?
I would think it would be something like this, but I suspect the paste function is not setup correctly.
library(pdftools)
for(i in values){'085-30',' 086-30','087-30','088-30','089-30'
paste(download.file("http://www2.sas.com/proceedings/sugi30/"i".pdf", i".pdf", mode = "wb")sep = "", collapse = NULL)
}
Upvotes: 0
Views: 176
Reputation: 70653
You can get a list of pdfs using rvest
.
library(rvest)
x <- read_html("http://www2.sas.com/proceedings/sugi30/toc.html#dp")
href <- x %>% html_nodes("a") %>% html_attr("href")
# char vector of links, use regular expression to fetch only papers
links <- href[grepl("^http://www2.sas.com/proceedings/sugi30/\\d{3}.*\\.pdf$", href)]
I've added some error handling and don't forget to put R session to sleep so you don't flood the server. In case a download is unsuccessful, the link is stored into a variable which you can investigate after the loop has finished and perhaps adapt your code or just download them manually.
# write failed links to this variable
unsuccessful <- c()
for (link in links) {
out <- tryCatch(download.file(url = link, destfile = basename(link), mode = "wb"),
error = function(e) e, warning = function(w) w)
if (class(out) %in% c("simpleError", "simpleWarning")) {
message(sprintf("Unable to download %s ?", link))
unsuccessful <- c(unsuccessful, link)
}
sleep <- abs(rnorm(1, mean = 10, sd = 10))
message(sprintf("Sleeping for %f seconds", sleep))
Sys.sleep(sleep) # don't flood the server, sleep for a while
}
Upvotes: 2