Reputation: 3608
I would like to read some tables in R from http://www.nhl.com/stats/player?navid=nav-sts-indiv#, and a single table only shows 30 rows, when there are a total of >300. Using XML and readHTMLTables, how can I extract and combine all of the tables into one large dataframe that I can do analysis on.
I used a loop to read through pages 1-11, but there are many different tables I am interested in, and do not think it efficient to look at every one to see how many total tables/rows there are.
require(XML)
url='http://www.nhl.com/stats/player?fetchKey=20153ALLSASAll&viewName=summary&sort=points&gp=1&pg='
a=NULL
for(i in 1:11){
w=paste(url,i,sep='')
b=readHTMLTable(w)[[3]]
a=rbind(a,b)
}
Note that there are 11 url's that I know of, because I looked it up manually.
Is there a way anyone can think of where I can automatically calculate the total number of rows/tables so I do not look it up manually. Maybe there is a function out there that I have not come across?
Upvotes: 1
Views: 131
Reputation: 550
You could scrape the number of pages and then throw that into your loop. Here's a quick extension of your scraper:
require(XML)
require(stringr)
url <- 'http://www.nhl.com/stats/player?fetchKey=20153ALLSASAll&viewName=summary&sort=points&gp=1&pg='
# Scrape the first page to get the number of pages
w1 <- paste(url, 1, sep = '')
# Parse using the XPath found by inspecting the page
page_divs <- xpathApply(htmlParse(w1), '//*[@id="statsPage"]/div/div/a')
# Extract the last div
last_div <- page_divs[length(page_divs)]
# Extact the page
last_url <- xmlGetAttr(last_div[[1]], "href")
# Extract max page
max_page <- str_extract(str_extract(last_url, "pg=.\\d"), "[[:digit:]]+")
a <- NULL
for(i in 1:max_page) {
w <- paste(url, i , sep = '')
b <- readHTMLTable(w)[[3]]
a <- rbind(a, b)
}
Upvotes: 1