Reputation: 11
I'm trying to scrap a table on a website, but can't succeed... I've already done that numerous time, it always worked, but since time the table seems to be in some sort of a Javascript, and the parsing doesn't work at all? Can someone help me?
The page is here.
I already tried the usual:
readHTMLTable(doc//table[@id='live-player-home-offensive-grid'], as.data.frame=TRUE, header=FALSE)
# or
xpathSApply(pagetree, "//*/table[@id='live-player-home-offensive-grid']", xmlValue)
Upvotes: 1
Views: 802
Reputation: 32401
The problem is that the data is not in the table, but in the Javascript code -- it is only put in the table when the page is rendered, in your browser.
I do not see a clean way of extracting it, short of using Javacript tools or web browser controllers (Zombie.js, CasperJS, PhantomJS, Selenium).
The following reads the HTML page as a string,
and looks for the definition of the initialData
variable,
that apparently contains the data.
It returns the data in the same hard-to-use format,
a list of lists of lists of lists of lists of lists of lists.
library(RCurl)
url <- "http://www.whoscored.com/Matches/411429/LiveStatistics/England-Premier-League-2010-2011-Fulham-Arsenal"
html <- getURL(url)
initial_data <- gsub("^.*?initialData = (.*?);.*", "\\1", html)
initial_data <- gsub("'", '"', initial_data)
library(RJSONIO)
data <- fromJSON( initial_data )
Upvotes: 1