user2923574
user2923574

Reputation: 11

Scraping HTML (or JavaScript) Table

I'm trying to scrap a table on a website, but can't succeed... I've already done that numerous time, it always worked, but since time the table seems to be in some sort of a Javascript, and the parsing doesn't work at all? Can someone help me?

The page is here.

I already tried the usual:

readHTMLTable(doc//table[@id='live-player-home-offensive-grid'], as.data.frame=TRUE, header=FALSE)
# or
xpathSApply(pagetree, "//*/table[@id='live-player-home-offensive-grid']", xmlValue)

Upvotes: 1

Views: 802

Answers (1)

Vincent Zoonekynd
Vincent Zoonekynd

Reputation: 32401

The problem is that the data is not in the table, but in the Javascript code -- it is only put in the table when the page is rendered, in your browser.

I do not see a clean way of extracting it, short of using Javacript tools or web browser controllers (Zombie.js, CasperJS, PhantomJS, Selenium).

The following reads the HTML page as a string, and looks for the definition of the initialData variable, that apparently contains the data. It returns the data in the same hard-to-use format, a list of lists of lists of lists of lists of lists of lists.

library(RCurl)
url <- "http://www.whoscored.com/Matches/411429/LiveStatistics/England-Premier-League-2010-2011-Fulham-Arsenal"
html <- getURL(url)
initial_data <- gsub("^.*?initialData = (.*?);.*", "\\1", html)
initial_data <- gsub("'", '"', initial_data)
library(RJSONIO)
data <- fromJSON( initial_data )

Upvotes: 1

Related Questions