steelersfan13
steelersfan13

Reputation: 47

R web scraping packages failing to read in all tables of url

I'm trying to scrape a number of tables from the following link: 'https://www.pro-football-reference.com/boxscores/201209050nyg.htm' From what I can tell from trying a number of methods/packages I think R is failing to read in the entire url. Here's a few attempts I've made:

a <- getURL(url)
tabs <- readHTMLTable(a, stringsAsFactors = T)

and

x <- read_html(url)
y <- html_nodes(x,xpath= '//*[@id="div_home_snap_counts"]')

I've had success reading in the first two tables with both methods but after that I can't read in any others regardless of whether I use xpath or css. Does anyone have any idea why I'm failing to read in these later tables?

Upvotes: 2

Views: 72

Answers (1)

QHarr
QHarr

Reputation: 84465

If you use a browser like Chrome you can go into settings and disable javascript. You will then see that only a few tables are present. The rest require javascript to run in order to load. Those are not being loaded, as displayed in browser, when you use your current method. Possible solutions are:

  1. Use a method like RSelenium which will allow javascript to run
  2. Inspect HTML of page to see if info is stored elsewhere and can be obtained from there. Sometimes info is retrieved from script tags, for example, where it is stored as json/javascript object
  3. Monitor network traffic when refreshing page (F12 to open dev tools and then Network tab) and see if you can find the source where the additional content is being loaded from. You may find other endpoints you can use).

Looking at the page it seems that at least two of those missing tables (likely all) are actually stored in comments in the returned html, associated with divs having class placeholder; and that you need to remove either the comments marks, or use a method that allows for parsing comments. Presumably, when javascript runs these comments are converted to displayed content.

Here is an example from the html:

Looking at this answer by @alistaire, one method is as follows (shown for single example table as per above image)

library(rvest)

h <- read_html('https://www.pro-football-reference.com/boxscores/201209050nyg.htm')

df <- h %>% html_nodes(xpath = '//comment()') %>%   
  html_text() %>%  
  paste(collapse = '') %>%   
  read_html() %>%  
  html_node('#game_info') %>%   
  html_table() 

Upvotes: 2

Related Questions