Reputation: 37
I would like to scrape a website storing German cycling results, but I'm struggling to get the urls pointing to the race result. Website with result table
This is what I got so far, to me the html table seems also quite oddly formatted, but that could also be due to my lack of html knowledge:
library(tidyverse)
library(magrittr)
library(rvest)
#read html
result_url <- "https://www.rad-net.de/rad-net-ergebnisse.htm?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1"
results <- read_html(result_url)
#extract date, race name
results %>%
html_table(header = T, fill = T) %>%
extract2(8) %>%
tibble()
#> # A tibble: 40 x 2
#> Datum Veranstaltungstitel
#> <chr> <chr>
#> 1 So, 19.07.20… "5. Rosenheimer Jugend - Kriterium"
#> 2 So, 12.07.20… "Swiss O Par Preis"
#> 3 So, 12.07.20… "Deutsche Meisterschaft Einzelzeitfahren U19m/w"
#> 4 So, 12.07.20… "Jugendrenntag der RV Offenbach"
#> 5 Sa, 04.07.20… "CoronaChronoNRW"
#> 6 Sa, 20.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#> 7 Sa, 13.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#> 8 Sa, 06.06.20… "GCA Klassiker PROS / KIDS / ALL powered by «Müller \u0096 Die…
#> 9 So, 31.05.20… "Westsachsenklassiker - 72. Sachsenringradrennen"
#> 10 So, 08.03.20… "8. Herforder Frühjahrspreis"
#> # … with 30 more rows
Created on 2020-07-25 by the reprex package (v0.3.0)
Upvotes: 0
Views: 850
Reputation: 173803
I think you're looking for a bit more information than is normally provided by the html_table
function (there are actually several nested html tables on the page anyway). I think this is what you are looking for:
library(tidyverse)
library(magrittr)
library(rvest)
results <- paste0("https://www.rad-net.de/rad-net-ergebnisse.htm",
"?name=Ausschreibung&view=ascr_erg&rnswp_disziplin=1") %>%
read_html()
link_nodes <- results %>% html_nodes(xpath = "//table//a")
link_text <- link_nodes %>% html_text()
index <- (which(link_text == "hier") + 1):(which(link_text == "N\u00e4chste") - 1)
link_nodes <- link_nodes[index]
dates <- link_nodes %>%
html_nodes(xpath = "//table//a/parent::td/preceding-sibling::td/font") %>%
html_text()
df <- tibble(Datum = dates[-1],
Veranstaltungstitel = link_nodes %>% html_text(),
link = link_nodes %>% html_attr("href"))
df
#> # A tibble: 40 x 3
#> Datum Veranstaltungstitel link
#> <chr> <chr> <chr>
#> 1 So, 19.0~ "5. Rosenheimer Jugend - Kriterium" /rad-net-portal/rad-net-erge~
#> 2 So, 12.0~ "Swiss O Par Preis" /rad-net-portal/rad-net-erge~
#> 3 So, 12.0~ "Deutsche Meisterschaft Einzelzeitfa~ /rad-net-portal/rad-net-erge~
#> 4 So, 12.0~ "Jugendrenntag der RV Offenbach" /rad-net-portal/rad-net-erge~
#> 5 Sa, 04.0~ "CoronaChronoNRW" /rad-net-portal/rad-net-erge~
#> 6 Sa, 20.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#> 7 Sa, 13.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#> 8 Sa, 06.0~ "GCA Klassiker PROS / KIDS / ALL pow~ /rad-net-portal/rad-net-erge~
#> 9 So, 31.0~ "Westsachsenklassiker - 72. Sachsenr~ /rad-net-portal/rad-net-erge~
#> 10 So, 08.0~ "8. Herforder Frühjahrspreis" /rad-net-portal/rad-net-erge~
#> # ... with 30 more rows
Created on 2020-07-25 by the reprex package (v0.3.0)
Upvotes: 1