bandcar
bandcar

Reputation: 743

How to scrape links off a website in R? Unable to to scrape links

I'm trying to scrape the links off of this website

library(rvest)
library(tidyverse)
url=read_html('https://web.archive.org/web/*/https://www.bjjcompsystem.com/tournaments/1869/categories*')

get_links <- url %>% 
  html_nodes('#resultsUrl a') %>% 
  html_attr('href') %>%
  paste0('https://web.archive.org/web/20220000000000*/', .)
get_links

But all I get is character(0). I even tried looking for the li class as has been suggested to me before, but there is nothing useful.

Can someone explain what I'm doing wrong and how to fix it?

Upvotes: 0

Views: 76

Answers (1)

HoelR
HoelR

Reputation: 6583

Get the links from their source

library(tidyverse)
library(httr2)
library(janitor)

"https://web.archive.org/web/timemap/json?url=https://www.bjjcompsystem.com/tournaments/1869/categories&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&filter=!statuscode:[45]..&limit=10000&_=1663136483842" %>% 
  request() %>% 
  req_perform() %>% 
  resp_body_json(simplifyVector = TRUE) %>% 
  as_tibble() %>% 
  row_to_names(1)

# A tibble: 784 × 6
   original                                           mimet…¹ times…² endti…³ group…⁴ uniqc…⁵
   <chr>                                              <chr>   <chr>   <chr>   <chr>   <chr>  
 1 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 3       3      
 2 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 6       6      
 3 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2       2      
 4 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
 5 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2       2      
 6 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
 7 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
 8 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 2       2      
 9 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
10 https://www.bjjcompsystem.com/tournaments/1869/ca… text/h… 202209… 202209… 1       1      
# … with 774 more rows, and abbreviated variable names ¹​mimetype, ²​timestamp, ³​endtimestamp,
#   ⁴​groupcount, ⁵​uniqcount
# ℹ Use `print(n = ...)` to see more rows

Upvotes: 2

Related Questions