CoolGuyHasChillDay
CoolGuyHasChillDay

Reputation: 747

R: Scraping multiple tables in URL

I'm learning how to scrape information from websites using httr and XML in R. I'm getting it to work just fine for websites with just a few tables, but can't figure it out for websites with several tables. Using the following page from pro-football-reference as an example: https://www.pro-football-reference.com/boxscores/201609110atl.htm

# To get just the boxscore by quarter, which is the first table:
URL = "https://www.pro-football-reference.com/boxscores/201609080den.htm"
URL = GET(URL)
SnapTable = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)[[1]]

# Return the number of tables:
AllTables = readHTMLTable(rawToChar(URL$content), stringAsFactors=F)
length(AllTables)
[1] 2

So I'm able to scrape info, but for some reason I can only capture the top two tables out of the 20+ on the page. For practice, I'm trying to get the "Starters" tables and the "Officials" tables.

Is my inability to get the other tables a matter of the website's setup or incorrect code?

Upvotes: 0

Views: 649

Answers (1)

Christian
Christian

Reputation: 359

If it comes down to web scraping in R make intensive use of the package rvest.

While managing to get the html is just about fine - rvest makes use of css selectors - SelectorGadget helps finding a pattern in styling for a particular table which is hopefully unique. Therefore you can extract exactly the tables you are looking for instead of coincidence

To get you started - read the vignette on rvest for more detailed information.

#install.packages("rvest")
library(rvest)
library(magrittr)

# Store web url
fb_url = "https://www.pro-football-reference.com/boxscores/201609080den.htm"

linescore = fb_url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="content"]/div[3]/table') %>%
    html_table()

Hope this helps.

Upvotes: 1

Related Questions