readHTMLTable in R only bringing back first two tables from basketball-reference page

Question

I am trying to scrape the team stats webpage from basketball-reference.com but when I use readHTML it is only bringing back the top two tables.

My R code looks like this:

url = "http://www.basketball-reference.com/leagues/NBA_2015.html"
teamPageTables = readHTMLTable(url)

This returns a list of only 2. The top two tables on the page. I would expect a list with all of the tables from the page.

I have also tried using rvest with the XPath of the table i want (the Miscellaneous Stats table) but with no luck there either.

Has BBR changed something to block the scraping. I have even seen other posts about scraping the team site that indicted the table he wanted was at index 16...i copied his code and still nothing.

Any help would be greatly appreciated. Thanks,

Parfait · Accepted Answer

Because the other tables are in comments, the readHTMLTable() does not capture it. However, consider reading the URL text with readLines and then removing the comment tags , from there parse the document accordingly. Turns out there are 85 tables on the page! Below extracts the 10 tables immediately viewable on screen:

library(XML)

# READ URL TEXT
url <- "http://www.basketball-reference.com/leagues/NBA_2015.html"
urltxt <- readLines(url)
# REMOVE COMMENT TAGS
urltxt <- gsub("-->", "", gsub("

readHTMLTable in R only bringing back first two tables from basketball-reference page

Answers (2)

Related Questions