Reputation: 992
I am new to web scraping and am trying to scrape tables on multiple web pages. Here is the site: http://www.baseball-reference.com/teams/MIL/2016.shtml
I am able to scrape a table on one page rather easily using rvest
. There are multiple tables, but I only wanted to scrape the first one, here is my code
library(rvest)
url4 <- "http://www.baseball-reference.com/teams/MIL/2016.shtml"
Brewers2016 <- url4 %>% read_html() %>%
html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>%
html_table()
Brewers2016 <- as.data.frame(Brewers2016)
The problem is that I want to scrape the first table on the page dating back to 1970. There is a link specifying the previous year at the top left corner just above the table. Does anybody know how I can do this?
I am also open to different ways of doing this, for example, a package other than rvest that might work better. I used rvest because it's the one I started learning.
Upvotes: 3
Views: 1756
Reputation: 20463
One way would be to make vector of all the urls
you are interested in and then use sapply
:
library(rvest)
years <- 1970:2016
urls <- paste0("http://www.baseball-reference.com/teams/MIL/", years, ".shtml")
# head(urls)
get_table <- function(url) {
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="div_team_batting"]/table[1]') %>%
html_table()
}
results <- sapply(urls, get_table)
results
should be a list of 47 data.frame
objects; each should be named with the url
(i.e., year) they represent. That is, results[1]
corresponds to 1970, and results[47]
corresponds to 2016.
Upvotes: 7