Scraping information from multiple webpages using rvest

Question

I am trying to scrape the results from the 2012-2016 Stockholm Marathon races. I am able to do so using the code outlined below, but every time that I've scraped the results from one year I have to go through the process of manually changing the URL to capture the next year.

This bothers me as the only thing that needs to change is the bold part of http://results.marathon.se/2012/?content=list&event=STHM&num_results=250&page=1&pid=list&search[sex]=M&lang=SE.

How can I modify the code below so that it scrapes the results from each year, outputting the results into a single dataframe that also includes a column to indicate the year to which the observation belongs?

library(dplyr)
library(rvest)
library(tidyverse)

# Find the total number of pages to scrape
tot_pages <- read_html('http://results.marathon.se/2012/?content=list&event=STHM&num_results=250&page=1&pid=list&search[sex]=M&lang=EN') %>%
  html_nodes('a:nth-child(6)') %>% html_text() %>% as.numeric()

#Store the URLs in a vector
URLs <- sprintf('http://results.marathon.se/2012/?content=list&event=STHM&num_results=250&page=%s&pid=list&search[sex]=M&lang=EN', 1:tot_pages)

#Create a progress bar
pb <- progress_estimated(tot_pages, min = 0)

# Create a function to scrape the name and finishing time from each page
getdata <- function(URL) {
  pb$tick()$print()
  pg <- read_html(URL)
  html_nodes(pg, 'tbody td:nth-child(3)') %>% html_text() %>% as_tibble() %>% set_names(c('Name')) %>%
mutate(finish_time = html_nodes(pg, 'tbody .right') %>% html_text())
}

#Map everything into a dataframe
map_df(URLs, getdata) -> results

ulfelder · Accepted Answer

You can use lapply to do this:

library(dplyr)
library(rvest)
library(tidyverse)

# make a vector of the years you want
years <- seq(2012,2016)

# now use lapply to iterate your code over those years
Results.list <- lapply(years, function(x) {

  # make a target url with the relevant year
  link <- sprintf('http://results.marathon.se/%s/?content=list&event=STHM&num_results=250&page=1&pid=list&search[sex]=M&lang=EN', x)

  # Find the total number of pages to scrape
  tot_pages <- read_html(link) %>%
    html_nodes('a:nth-child(6)') %>% html_text() %>% as.numeric()

  # Store the URLs in a vector
  URLs <- sprintf('http://results.marathon.se/%s/?content=list&event=STHM&num_results=250&page=%s&pid=list&search[sex]=M&lang=EN', x, 1:tot_pages)

  #Create a progress bar
  pb <- progress_estimated(tot_pages, min = 0)

  # Create a function to scrape the name and finishing time from each page
  getdata <- function(URL) {
    pb$tick()$print()
    pg <- read_html(URL)
    html_nodes(pg, 'tbody td:nth-child(3)') %>% html_text() %>% as_tibble() %>% set_names(c('Name')) %>%
    mutate(finish_time = html_nodes(pg, 'tbody .right') %>% html_text())
  }

  #Map everything into a dataframe
  map_df(URLs, getdata) -> results

  # add an id column indicating which year
  results$year <- x

  return(results)

})

# now collapse the resulting list into one tidy df
Results <- bind_rows(Results.list)

Scraping information from multiple webpages using rvest

Answers (1)

Related Questions