gebo-aa
gebo-aa

Reputation: 37

Retrieve link from html table with rvest

I would like to scrape a website storing German cycling results, but I'm struggling to get the urls pointing to the race result. Website with result table

This is what I got so far, to me the html table seems also quite oddly formatted, but that could also be due to my lack of html knowledge:

library(tidyverse)
library(magrittr)
library(rvest)

#read html
result_url <- "https://www.rad-net.de/rad-net-ergebnisse.htm?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1"
results <- read_html(result_url)
#extract date, race name
results %>%
  html_table(header = T, fill = T) %>% 
  extract2(8) %>% 
  tibble()
#> # A tibble: 40 x 2
#>    Datum         Veranstaltungstitel                                            
#>    <chr>         <chr>                                                          
#>  1 So, 19.07.20… "5. Rosenheimer Jugend - Kriterium"                            
#>  2 So, 12.07.20… "Swiss O Par Preis"                                            
#>  3 So, 12.07.20… "Deutsche Meisterschaft Einzelzeitfahren U19m/w"               
#>  4 So, 12.07.20… "Jugendrenntag der RV Offenbach"                               
#>  5 Sa, 04.07.20… "CoronaChronoNRW"                                              
#>  6 Sa, 20.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#>  7 Sa, 13.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#>  8 Sa, 06.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#>  9 So, 31.05.20… "Westsachsenklassiker - 72. Sachsenringradrennen"              
#> 10 So, 08.03.20… "8. Herforder Frühjahrspreis"                                  
#> # … with 30 more rows

Created on 2020-07-25 by the reprex package (v0.3.0)

Upvotes: 0

Views: 850

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173803

I think you're looking for a bit more information than is normally provided by the html_table function (there are actually several nested html tables on the page anyway). I think this is what you are looking for:

library(tidyverse)
library(magrittr)
library(rvest)

results <- paste0("https://www.rad-net.de/rad-net-ergebnisse.htm",
                  "?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1") %>%
             read_html()

link_nodes <- results %>% html_nodes(xpath = "//table//a")  
link_text  <- link_nodes %>% html_text()
index <- (which(link_text == "hier") + 1):(which(link_text == "N\u00e4chste") - 1)
link_nodes <- link_nodes[index]
dates <- link_nodes %>% 
          html_nodes(xpath = "//table//a/parent::td/preceding-sibling::td/font") %>%
          html_text()
df <- tibble(Datum = dates[-1], 
             Veranstaltungstitel = link_nodes %>% html_text(),
             link = link_nodes %>% html_attr("href"))

df
#> # A tibble: 40 x 3
#>    Datum     Veranstaltungstitel                   link                         
#>    <chr>     <chr>                                 <chr>                        
#>  1 So, 19.0~ "5. Rosenheimer Jugend - Kriterium"   /rad-net-portal/rad-net-erge~
#>  2 So, 12.0~ "Swiss O Par Preis"                   /rad-net-portal/rad-net-erge~
#>  3 So, 12.0~ "Deutsche Meisterschaft Einzelzeitfa~ /rad-net-portal/rad-net-erge~
#>  4 So, 12.0~ "Jugendrenntag der RV Offenbach"      /rad-net-portal/rad-net-erge~
#>  5 Sa, 04.0~ "CoronaChronoNRW"                     /rad-net-portal/rad-net-erge~
#>  6 Sa, 20.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#>  7 Sa, 13.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#>  8 Sa, 06.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#>  9 So, 31.0~ "Westsachsenklassiker - 72. Sachsenr~ /rad-net-portal/rad-net-erge~
#> 10 So, 08.0~ "8. Herforder Frühjahrspreis"         /rad-net-portal/rad-net-erge~
#> # ... with 30 more rows

Created on 2020-07-25 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions