XML package in R - readHTMLTable and multiple row classes

Question

I'm trying to scrape data from this website Extra Skater

into a data frame. From what I can tell looking at the HTML code, there are multiple row classes through which you can toggle to display different table rows. I'm only interested in the rows which have the label:

For example:


    5v5close

    2013-01-19: Maple Leafs 2 at Canadiens 1

    19.7
    0
    0
    14    
    18
    43.8%
    11
    15
    42.3%
    8
    11
    42.1%
    0.0%
    100.0%

When I run the code:

library(RCurl)
library(XML)
theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
tb = readHTMLTable(theurl)

It returns a list with all the table rows stacked one on top of the other. I imagine that I have to use xpathSApply to have more precision, but I am unsure about the path argument. When I run the code:

library(RCurl)
library(XML)

theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE)

# Extract table header and contents
results <- xpathSApply(pagetree, "//*/table[@class='team-game-stats team-game-stats-5v5close hidden']/tr/td", xmlValue)

results return as NULL.

Thanks for your time.

agstudy · Accepted Answer

Try this :

xxpath = "//*[@class='team-game-stats team-game-stats-5v5close hidden']"
xpathApply(pagetree,xxpath,readHTMLList)

XML package in R - readHTMLTable and multiple row classes

Answers (2)

Related Questions