Reputation: 33538
I'm new to html
but playing with a script to download all PDF
files that a given webpage links to (for fun and avoiding boring manual work) and I can't to find where in the html
document I should look for the data that completes relative paths - I know it is possible since my web browser can do it.
Example: I trying to scrape lecture notes linked to on this page from ocw.mit.edu using R
package rvest
looking at the raw html
or accessing the href
attribute of a
"nodes" I only get relative paths:
library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
"electrical-engineering-and-computer-science/",
"6-006-introduction-to-algorithms-fall-2011/lecture-notes/")
# Read webpage and extract all links
links_all <- read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1]
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"
Upvotes: 2
Views: 968
Reputation: 1615
The easiest solution that I have found as of today is using the url_absolute(x, base)
function of the xml2
package. For the base parameter, you use the url of the page you retrieved the source from.
This seems less error prone than trying to extract the base url of the address via regexp.
Upvotes: 4