Scraping Hyperlinks with Rvest

Question

I'd like to scrape the text and hyperlinks (of .xlsx and .pdf files) from a page using rvest. I'm not very good at this, so it's hard to tell if I'm dealing with a complicated webpage, or am just making newbie mistakes. My code thus far:

my.url <- "https://comptroller.defense.gov/Budget-Materials/Budget2019/"
my.xpath <- '//*[@id="LiveHTMLWrapper92093"]/div/div'

x <- read_html(my.url) %>% 
  html_node(xpath = my.xpath) 

{xml_node}

[1]

Alexandre georges · Accepted Answer

Here a solution :

my.url <- "https://comptroller.defense.gov/Budget-Materials/Budget2019/"
my.xpath <- '//*[@id="dnn_ctr92093_ContentPane"]'

x <- read_html(my.url) %>% 
  html_node(xpath = my.xpath) %>% html_nodes("a") %>% html_text()

y <- read_html(my.url) %>% 
  html_node(xpath = my.xpath) %>% html_nodes("a") %>% html_attr("href") 

y <- ifelse(grepl(pattern = "/Portals/",y), paste0("https://comptroller.defense.gov",y),y)

df <- as.data.frame(cbind(x,y))

Scraping Hyperlinks with Rvest

Answers (1)

Related Questions