Reputation: 1

Obtaining hrefs from main page using rvest (SelectorGadget and inspecting source code)

I am using rvest to scrape a website (here). I am trying to obtain the URL for all 582 individuals listed. For instance, the URL for one of the individuals is here.

Once, I am inside an individual URL, I am able to successfully scrape the information I am looking for. Here is an example of that:

link = "https://www.supercluster.com/astronauts/jessica-u.-meir?sort=&ascending=false&life%20form=human&"

page = read_html(link)

# Time in space and spacewalk time
page %>% html_nodes("span.pr015")

# Gender
page %>% html_nodes("a.under")

# Cross Karman Line
page %>% html_nodes("div.pt1.pb0.h5.caps.cw")

Any advice on how to obtain the list of 582 URLs from the main page using rvest? I tried using SelectorGadget and inspecting the source code - but to no avail. Thank you for your help!

Upvotes: 0

Answers (1)

Adam Sampson

Reputation: 2021

Because this is being loaded on the fly with javascript you have to consider whether you can access where the data is coming from. Using the network inspector in Chrome/Firefox you can see all of the data sources that load when the site loads.

From there you can see that the list of all astronauts is from the following data source: https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json

Similarly, you can see that additional details can be obtained from https://www.supercluster.com/page-data/astronauts/vladimir-dzhanibekov/page-data.json using a GET request to skip the "scraping" part of your current script. This also makes your request faster and it uses a lot less data. But you'd have to figure out those links as it is a separate question.

library(dplyr)
library(rvest)
library(httr)
library(jsonlite)

list_astro <- httr::GET("https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json")

list_astro <- rawToChar(list_astro$content)

list_astro_parsed <- jsonlite::fromJSON(list_astro)

create_links <- tibble(
  astronauts = list_astro_parsed$astronauts$name,
  slug = list_astro_parsed$astronauts$slug$current
) %>%
  mutate(
    page_link = paste0("https://www.supercluster.com/astronauts/",slug)
  )
create_links
# A tibble: 910 x 3
#   astronauts            slug                  link                                                         
#   <chr>                 <chr>                 <chr>                                                        
# 1 Yuri Gagarin          yuri-gagarin          https://www.supercluster.com/astronauts/yuri-gagarin         
# 2 Walter M. Schirra Jr. walter-m.-schirra-jr. https://www.supercluster.com/astronauts/walter-m.-schirra-jr.
# 3 Georgi Ivanov         georgi-ivanov         https://www.supercluster.com/astronauts/georgi-ivanov        
# 4 Leonid Popov          leonid-popov          https://www.supercluster.com/astronauts/leonid-popov         
# 5 Bertalan Farkas       bertalan-farkas       https://www.supercluster.com/astronauts/bertalan-farkas

Upvotes: 2

Obtaining hrefs from main page using rvest (SelectorGadget and inspecting source code)

Answers (1)

Related Questions