jvalenti
jvalenti

Reputation: 640

web scraping across multiple pages with rvest

I have a website that I would like to scrape data from, but there are 8 pages worth of data. I have used the following to get the first page of data

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&college_id=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=year_id'
webpage <- read_html(site)

draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
head(draft)
draft <- draft[-1,]
names(draft) <- c("rank", "year", "league", "round", "pick", "team",     "player", "age", "position", "birth", "college",
                  "yearin", "lastyear", "car.gp", "car.mp", "car.ppg", "car.rebpg", "car.apg", "car.stlpg", "car.blkpg",
              "car.fgp", "car.2pfgp", "car.3pfgp", "car.ftp", "car.ws", "car.ws48")

draft <- draft[draft$player != "" & draft$player != "Player", ]

It appears that the urls move in sequence, with the first page having offset=0, and the second offset=100, the third page offset=200 and so on.

My problem is that I cannot find a straightforward way to scrape all 8 pages at once without manually pasting the urls into the "site" vector above. I would like to be able to do it generally so if there exists a general rvest solution that would be great. Otherwise, any help or suggestions would be greatly appreciated. Thanks so much.

Upvotes: 3

Views: 1333

Answers (1)

Brandon LeBeau
Brandon LeBeau

Reputation: 331

The function follow_link is what you are looking for.

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&college_id=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=year_id'

s <- html_session(site)
s <- s %>% follow_link(css = '#pi p a')
url2 <- s$handle$url

s <- s %>% follow_link(css = '#pi a+ a')
url3 <- s$handle$url

It appeared the link pattern recycles after the second page, so subsequent pages can be navigated to with follow_link(css = '#pi a+ a').

Upvotes: 3

Related Questions