Reputation: 1
I am using rvest
to scrape a website (here). I am trying to obtain the URL for all 582 individuals listed. For instance, the URL for one of the individuals is here.
Once, I am inside an individual URL, I am able to successfully scrape the information I am looking for. Here is an example of that:
link = "https://www.supercluster.com/astronauts/jessica-u.-meir?sort=&ascending=false&life%20form=human&"
page = read_html(link)
# Time in space and spacewalk time
page %>% html_nodes("span.pr015")
# Gender
page %>% html_nodes("a.under")
# Cross Karman Line
page %>% html_nodes("div.pt1.pb0.h5.caps.cw")
Any advice on how to obtain the list of 582 URLs from the main page using rvest
? I tried using SelectorGadget and inspecting the source code - but to no avail. Thank you for your help!
Upvotes: 0
Views: 202
Reputation: 2021
Because this is being loaded on the fly with javascript you have to consider whether you can access where the data is coming from. Using the network inspector in Chrome/Firefox you can see all of the data sources that load when the site loads.
From there you can see that the list of all astronauts is from the following data source: https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json
Similarly, you can see that additional details can be obtained from https://www.supercluster.com/page-data/astronauts/vladimir-dzhanibekov/page-data.json using a GET request to skip the "scraping" part of your current script. This also makes your request faster and it uses a lot less data. But you'd have to figure out those links as it is a separate question.
library(dplyr)
library(rvest)
library(httr)
library(jsonlite)
list_astro <- httr::GET("https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb.json")
list_astro <- rawToChar(list_astro$content)
list_astro_parsed <- jsonlite::fromJSON(list_astro)
create_links <- tibble(
astronauts = list_astro_parsed$astronauts$name,
slug = list_astro_parsed$astronauts$slug$current
) %>%
mutate(
page_link = paste0("https://www.supercluster.com/astronauts/",slug)
)
create_links
# A tibble: 910 x 3
# astronauts slug link
# <chr> <chr> <chr>
# 1 Yuri Gagarin yuri-gagarin https://www.supercluster.com/astronauts/yuri-gagarin
# 2 Walter M. Schirra Jr. walter-m.-schirra-jr. https://www.supercluster.com/astronauts/walter-m.-schirra-jr.
# 3 Georgi Ivanov georgi-ivanov https://www.supercluster.com/astronauts/georgi-ivanov
# 4 Leonid Popov leonid-popov https://www.supercluster.com/astronauts/leonid-popov
# 5 Bertalan Farkas bertalan-farkas https://www.supercluster.com/astronauts/bertalan-farkas
Upvotes: 2