user3265009
user3265009

Reputation: 23

XML package in R - readHTMLTable and multiple row classes

I'm trying to scrape data from this website Extra Skater

into a data frame. From what I can tell looking at the HTML code, there are multiple row classes through which you can toggle to display different table rows. I'm only interested in the rows which have the label:

<tr class="team-game-stats team-game-stats-5v5close hidden">

For example:

<tr class="team-game-stats team-game-stats-5v5close hidden">
    <td class="hidden">5v5close</td>

    <td><a href="/game/2013-01-19-maple-leafs-canadiens">2013-01-19: Maple Leafs 2 at Canadiens 1</a></td>

    <td class="number-right">19.7</td>
    <td class="number-right">0</td>
    <td class="number-right">0</td>
    <td class="number-right">14</td>    
    <td class="number-right">18</td>
    <td class="number-right">43.8%</td>
    <td class="number-right">11</td>
    <td class="number-right">15</td>
    <td class="number-right">42.3%</td>
    <td class="number-right">8</td>
    <td class="number-right">11</td>
    <td class="number-right">42.1%</td>
    <td class="number-right">0.0%</td>
    <td class="number-right">100.0%</td>

</tr>

When I run the code:

library(RCurl)
library(XML)
theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
tb = readHTMLTable(theurl)

It returns a list with all the table rows stacked one on top of the other. I imagine that I have to use xpathSApply to have more precision, but I am unsure about the path argument. When I run the code:

library(RCurl)
library(XML)

theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE)

# Extract table header and contents
results <- xpathSApply(pagetree, "//*/table[@class='team-game-stats team-game-stats-5v5close hidden']/tr/td", xmlValue)

results return as NULL.

Thanks for your time.

Upvotes: 2

Views: 354

Answers (2)

Chris S.
Chris S.

Reputation: 2225

Could you just filter the data.frame rather than the HTML?

tb <- readHTMLTable(theurl, which=1)
table(tb$Situation)
     5v5 5v5close  5v5tied      all       ev       pp       sh 
      48       48       48       48       48       48       48 
subset(tb, Situation=="5v5close")

Upvotes: 0

agstudy
agstudy

Reputation: 121588

Try this :

xxpath = "//*[@class='team-game-stats team-game-stats-5v5close hidden']"
xpathApply(pagetree,xxpath,readHTMLList)

Upvotes: 2

Related Questions