Read multiple tables through scraping

Question

I would like to read some tables in R from http://www.nhl.com/stats/player?navid=nav-sts-indiv#, and a single table only shows 30 rows, when there are a total of >300. Using XML and readHTMLTables, how can I extract and combine all of the tables into one large dataframe that I can do analysis on.

I used a loop to read through pages 1-11, but there are many different tables I am interested in, and do not think it efficient to look at every one to see how many total tables/rows there are.

    require(XML)
    url='http://www.nhl.com/stats/player?fetchKey=20153ALLSASAll&viewName=summary&sort=points&gp=1&pg='
a=NULL
for(i in 1:11){
    w=paste(url,i,sep='')
    b=readHTMLTable(w)[[3]]
    a=rbind(a,b)
}

Note that there are 11 url's that I know of, because I looked it up manually.

Is there a way anyone can think of where I can automatically calculate the total number of rows/tables so I do not look it up manually. Maybe there is a function out there that I have not come across?

vikjam · Accepted Answer

You could scrape the number of pages and then throw that into your loop. Here's a quick extension of your scraper:

require(XML)
require(stringr)

url <- 'http://www.nhl.com/stats/player?fetchKey=20153ALLSASAll&viewName=summary&sort=points&gp=1&pg='

# Scrape the first page to get the number of pages
w1 <- paste(url, 1, sep = '')

# Parse using the XPath found by inspecting the page
page_divs <- xpathApply(htmlParse(w1), '//*[@id="statsPage"]/div/div/a')

# Extract the last div
last_div  <- page_divs[length(page_divs)]

# Extact the page
last_url  <- xmlGetAttr(last_div[[1]], "href")

# Extract max page
max_page  <- str_extract(str_extract(last_url, "pg=.\d"), "[[:digit:]]+")

a <- NULL
for(i in 1:max_page) {
  w <- paste(url, i , sep = '')
  b <- readHTMLTable(w)[[3]]
  a <- rbind(a, b)
}

Read multiple tables through scraping

Answers (1)

Related Questions