Aveshen Pillay
Aveshen Pillay

Reputation: 481

Finding the correct attributes to scrape within a page using rvest

I am trying to scrape the underlying hyperlinks on a webpage but selecting the html nodes and corresponding attributes is not giving any results. I don’t know whether the data is stored in a meta tag or how to even identify that.

Using selectorgadget, I think that the css selector is “td”, but I can also see “tr” in the page. Opening the dev tools, I can see the link under the href attribute, but not getting that result out when running the following code:

library(rvest)

url = "https://www.firstrand.co.za/investors/debt-investor-centre/jse-listed-instruments/"

read_html(url) %>%
  html_nodes(css = "td") %>%
  html_nodes(css = "a") %>%
  html_attr('href')

Page elements:

enter image description here

Upvotes: 1

Views: 86

Answers (2)

mkpt_uk
mkpt_uk

Reputation: 256

If you look behind the scenes you will see that the information is provided to the webpage from a json file. This can easily be read directly and manipulated to provide the url and all the other information that is on the page.

library(tidyverse)
library(jsonlite)

l <- read_json("https://www.firstrand.co.za/DI/debtInstruments.json")

df <- l %>% 
  enframe %>% 
  unnest_longer(value) %>% 
  unnest_wider(value) %>% 
  mutate(url = paste0("https://www.firstrand.co.za/DI/", fileName))

Upvotes: 1

Nad Pat
Nad Pat

Reputation: 3173

Here's a partial answer.

Though we can extract the href using RSelenium, it further needs regex modifications to obtain working url.

library(RSelenium)
driver = rsDriver(
     port = 4847L,
       browser = c("firefox"))

remDr <- driver[["client"]]

url = "https://www.firstrand.co.za/investors/debt-investor-centre/jse-listed-instruments/"
remDr$navigate(url)


href = remDr$getPageSource()[[1]] %>% 
  read_html() %>% html_nodes('.jse-table') %>% html_nodes('a') %>% html_attr('href')

href = unique(href)
head(href)
[1] "../../../DI/FRB23 Pricing Supplement 20170920.pdf"                   "../../../DI/APS - FRB22 - 08.12.2016.pdf"                           
[3] "../../../DI/FRB28 Pricing Supplement 02122020 Amended.pdf"           "../../../DI/FRB24 Amended Pricing Supplement 13042021.pdf"          
[5] "../../../DI/FRB25 Amended Pricing Supplement 13042021 Tranche 2.pdf" "../../../DI/FRB25 Amended Pricing Supplement 13042021 Tranche 3.pdf"

Upvotes: 0

Related Questions