s_baldur
s_baldur

Reputation: 33538

Find absolute html path given relative href using R

I'm new to html but playing with a script to download all PDF files that a given webpage links to (for fun and avoiding boring manual work) and I can't to find where in the html document I should look for the data that completes relative paths - I know it is possible since my web browser can do it.

Example: I trying to scrape lecture notes linked to on this page from ocw.mit.edu using R package rvest looking at the raw html or accessing the href attribute of a "nodes" I only get relative paths:

library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
  "electrical-engineering-and-computer-science/",
  "6-006-introduction-to-algorithms-fall-2011/lecture-notes/")

# Read webpage and extract all links
links_all <- read_html(url)  %>% 
  html_nodes("a") %>%
  html_attr("href")

# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1] 
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"

Upvotes: 2

Views: 968

Answers (1)

kurast
kurast

Reputation: 1615

The easiest solution that I have found as of today is using the url_absolute(x, base) function of the xml2 package. For the base parameter, you use the url of the page you retrieved the source from.

This seems less error prone than trying to extract the base url of the address via regexp.

Upvotes: 4

Related Questions