Reputation: 1816
I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#
I can do the following:
library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")
but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is
html_text(doc)
but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.
What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.
Can anyone provide some hints as to how to extract that data from this site?
Upvotes: 10
Views: 2186
Reputation: 30465
Using Selenium
with phantomjs
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)
if you want to press the more data button until it is not visible (all matches presumed showing):
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
webElem$clickElement()
Sys.sleep(5)
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])
Remove unwanted round data and use XML::readHTMLTable
for simplicity
# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank Date hteam ateam score
1 01.04. 18:00 Ferronikeli Ferizaj 4 : 0
2 01.04. 18:00 Istogu Hajvalia 2 : 1
3 01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4 01.04. 18:00 Prishtina Drenica 3 : 0
5 31.03. 18:00 Besa Peje Drita 1 : 0
6 31.03. 18:00 Trepca 89 Vellaznimi 2 : 0
> tail(appData)
blank Date hteam ateam score
115 17.08. 22:00 Besa Peje Trepca 89 3 : 3
116 17.08. 22:00 Ferronikeli Hajvalia 2 : 5
117 17.08. 22:00 Trepca Mitrovice Ferizaj 1 : 0
118 17.08. 22:00 Vellaznimi Drenica 2 : 1
119 16.08. 22:00 Kosova Vushtrri Drita 0 : 1
120 16.08. 22:00 Prishtina Istogu 2 : 1
carry out further formatting as needed.
Upvotes: 12