Reputation: 23
I'm trying to scrape data from this website Extra Skater
into a data frame. From what I can tell looking at the HTML code, there are multiple row classes through which you can toggle to display different table rows. I'm only interested in the rows which have the label:
<tr class="team-game-stats team-game-stats-5v5close hidden">
For example:
<tr class="team-game-stats team-game-stats-5v5close hidden">
<td class="hidden">5v5close</td>
<td><a href="/game/2013-01-19-maple-leafs-canadiens">2013-01-19: Maple Leafs 2 at Canadiens 1</a></td>
<td class="number-right">19.7</td>
<td class="number-right">0</td>
<td class="number-right">0</td>
<td class="number-right">14</td>
<td class="number-right">18</td>
<td class="number-right">43.8%</td>
<td class="number-right">11</td>
<td class="number-right">15</td>
<td class="number-right">42.3%</td>
<td class="number-right">8</td>
<td class="number-right">11</td>
<td class="number-right">42.1%</td>
<td class="number-right">0.0%</td>
<td class="number-right">100.0%</td>
</tr>
When I run the code:
library(RCurl)
library(XML)
theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
tb = readHTMLTable(theurl)
It returns a list with all the table rows stacked one on top of the other. I imagine that I have to use xpathSApply to have more precision, but I am unsure about the path argument. When I run the code:
library(RCurl)
library(XML)
theurl <- "http://www.extraskater.com/team/montreal-canadiens/2012/gamelog"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, useInternalNodes = TRUE)
# Extract table header and contents
results <- xpathSApply(pagetree, "//*/table[@class='team-game-stats team-game-stats-5v5close hidden']/tr/td", xmlValue)
results return as NULL.
Thanks for your time.
Upvotes: 2
Views: 354
Reputation: 2225
Could you just filter the data.frame rather than the HTML?
tb <- readHTMLTable(theurl, which=1)
table(tb$Situation)
5v5 5v5close 5v5tied all ev pp sh
48 48 48 48 48 48 48
subset(tb, Situation=="5v5close")
Upvotes: 0
Reputation: 121588
Try this :
xxpath = "//*[@class='team-game-stats team-game-stats-5v5close hidden']"
xpathApply(pagetree,xxpath,readHTMLList)
Upvotes: 2