Reputation: 3
I am trying to scrape the team stats webpage from basketball-reference.com but when I use readHTML it is only bringing back the top two tables.
My R code looks like this:
url = "http://www.basketball-reference.com/leagues/NBA_2015.html"
teamPageTables = readHTMLTable(url)
This returns a list of only 2. The top two tables on the page. I would expect a list with all of the tables from the page.
I have also tried using rvest with the XPath of the table i want (the Miscellaneous Stats table) but with no luck there either.
Has BBR changed something to block the scraping. I have even seen other posts about scraping the team site that indicted the table he wanted was at index 16...i copied his code and still nothing.
Any help would be greatly appreciated. Thanks,
Upvotes: 0
Views: 562
Reputation: 107687
Because the other tables are in comments, the readHTMLTable()
does not capture it. However, consider reading the URL text with readLines
and then removing the comment tags <!--
and -->
, from there parse the document accordingly. Turns out there are 85 tables on the page! Below extracts the 10 tables immediately viewable on screen:
library(XML)
# READ URL TEXT
url <- "http://www.basketball-reference.com/leagues/NBA_2015.html"
urltxt <- readLines(url)
# REMOVE COMMENT TAGS
urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
# PARSE UNCOMMENTED TEXT
doc <- htmlParse(urltxt)
# RETRIEVE ALL <table> TAGS
tables <- xpathApply(doc, "//table")
# LIST OF DATAFRAMES
teamPageTables <- lapply(tables[c(1:2,19:26)], function(i) readHTMLTable(i))
Upvotes: 5
Reputation: 4378
only This web page has two valid html tables. The other tables are within the page as html comments, perhaps to be parsed by some javascript. You could perhaps try and parse these comments.
The code below shows finds two valid tables and writes the raw html to file. Open bb.html in a text editor and notice that many tables are within
library(rvest)
url <- "http://www.basketball-reference.com/leagues/NBA_2015.html"
page <- read_html(url)
# there are two valid tables - get them with css id's
team_stats_per_game <- html_node(page, "#team-stats-per_game")
divs_standings_E <- html_nodes(page, "#divs_standings_E")
# look at the actual page text - open bb.html in a text editor
text <- readLines(url)
writeLines(text, "bb.html")
The commented tables look like
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_misc_stats">
<table class="sortable stats_table" id="misc_stats" data-cols-to-freeze=2><caption>Miscellaneous Stats Table</caption>
etc.
-->
Upvotes: 0