YellowCat
YellowCat

Reputation: 3

opening PDF from a webpage in R

I'm trying to practice text analysis with the Fed FOMC minutes.

I was able to obtain all links to the appropriate pdf files from the link below. https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm

I tried download.file(https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20160316.pdf,"1.pdf").

The download was successful; however, when I click on the downloaded file, it outputs "There was an error opening this document. The file is damaged and could not be repaired." What are some ways to fix this? Is this a way of preventing web scraping on Fed's side?

I have 44 links(pdf files) to download and read in R. Is there a way to do this without physically downloading the files?

Upvotes: 0

Views: 247

Answers (1)

David Lucey
David Lucey

Reputation: 234

library(stringr)
library(rvest)
library(pdftools)

# Scrape the website with rvest for all href links
p <- 
  rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")

# Filter selected fomcminute paths and reconstruct html links
pdfs <- pdfs[stringr::str_detect(pdfs, "fomcminutes.*pdf")]
pdfs <- pdfs[!is.na(pdfs)]
paths <- paste0("https://www.federalreserve.gov/", pdfs)

# Scrape minutes as list of text files
pdf_data <- lapply(paths, pdftools::pdf_text)

Upvotes: 1

Related Questions