Reputation: 481
I am trying to scrape the underlying hyperlinks on a webpage but selecting the html nodes and corresponding attributes is not giving any results. I don’t know whether the data is stored in a meta tag or how to even identify that.
Using selectorgadget, I think that the css selector is “td”, but I can also see “tr” in the page. Opening the dev tools, I can see the link under the href attribute, but not getting that result out when running the following code:
library(rvest)
url = "https://www.firstrand.co.za/investors/debt-investor-centre/jse-listed-instruments/"
read_html(url) %>%
html_nodes(css = "td") %>%
html_nodes(css = "a") %>%
html_attr('href')
Page elements:
Upvotes: 1
Views: 86
Reputation: 256
If you look behind the scenes you will see that the information is provided to the webpage from a json file. This can easily be read directly and manipulated to provide the url and all the other information that is on the page.
library(tidyverse)
library(jsonlite)
l <- read_json("https://www.firstrand.co.za/DI/debtInstruments.json")
df <- l %>%
enframe %>%
unnest_longer(value) %>%
unnest_wider(value) %>%
mutate(url = paste0("https://www.firstrand.co.za/DI/", fileName))
Upvotes: 1
Reputation: 3173
Here's a partial answer.
Though we can extract the href
using RSelenium
, it further needs regex modifications to obtain working url.
library(RSelenium)
driver = rsDriver(
port = 4847L,
browser = c("firefox"))
remDr <- driver[["client"]]
url = "https://www.firstrand.co.za/investors/debt-investor-centre/jse-listed-instruments/"
remDr$navigate(url)
href = remDr$getPageSource()[[1]] %>%
read_html() %>% html_nodes('.jse-table') %>% html_nodes('a') %>% html_attr('href')
href = unique(href)
head(href)
[1] "../../../DI/FRB23 Pricing Supplement 20170920.pdf" "../../../DI/APS - FRB22 - 08.12.2016.pdf"
[3] "../../../DI/FRB28 Pricing Supplement 02122020 Amended.pdf" "../../../DI/FRB24 Amended Pricing Supplement 13042021.pdf"
[5] "../../../DI/FRB25 Amended Pricing Supplement 13042021 Tranche 2.pdf" "../../../DI/FRB25 Amended Pricing Supplement 13042021 Tranche 3.pdf"
Upvotes: 0