Reputation: 47
I'm trying to scrape a number of tables from the following link: 'https://www.pro-football-reference.com/boxscores/201209050nyg.htm' From what I can tell from trying a number of methods/packages I think R is failing to read in the entire url. Here's a few attempts I've made:
a <- getURL(url)
tabs <- readHTMLTable(a, stringsAsFactors = T)
and
x <- read_html(url)
y <- html_nodes(x,xpath= '//*[@id="div_home_snap_counts"]')
I've had success reading in the first two tables with both methods but after that I can't read in any others regardless of whether I use xpath or css. Does anyone have any idea why I'm failing to read in these later tables?
Upvotes: 2
Views: 72
Reputation: 84465
If you use a browser like Chrome you can go into settings and disable javascript. You will then see that only a few tables are present. The rest require javascript to run in order to load. Those are not being loaded, as displayed in browser, when you use your current method. Possible solutions are:
script
tags, for example, where it is stored as json/javascript objectLooking at the page it seems that at least two of those missing tables (likely all) are actually stored in comments in the returned html, associated with divs having class placeholder
; and that you need to remove either the comments marks, or use a method that allows for parsing comments. Presumably, when javascript runs these comments are converted to displayed content.
Here is an example from the html:
Looking at this answer by @alistaire, one method is as follows (shown for single example table as per above image)
library(rvest)
h <- read_html('https://www.pro-football-reference.com/boxscores/201209050nyg.htm')
df <- h %>% html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse = '') %>%
read_html() %>%
html_node('#game_info') %>%
html_table()
Upvotes: 2