Peter Verbeet
Peter Verbeet

Reputation: 1816

stumped on how to scrape the data from this site (using R)

I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#

I can do the following:

library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")

but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is

html_text(doc)

but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.

What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.

Can anyone provide some hints as to how to extract that data from this site?

Upvotes: 10

Views: 2186

Answers (1)

jdharrison
jdharrison

Reputation: 30465

Using Selenium with phantomjs

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)

if you want to press the more data button until it is not visible (all matches presumed showing):

webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
  webElem$clickElement()
  Sys.sleep(5)
  webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])

Remove unwanted round data and use XML::readHTMLTable for simplicity

# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank         Date           hteam            ateam score
1       01.04. 18:00     Ferronikeli          Ferizaj 4 : 0
2       01.04. 18:00          Istogu         Hajvalia 2 : 1
3       01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4       01.04. 18:00       Prishtina          Drenica 3 : 0
5       31.03. 18:00       Besa Peje            Drita 1 : 0
6       31.03. 18:00       Trepca 89       Vellaznimi 2 : 0

> tail(appData)
    blank         Date            hteam     ateam score
115       17.08. 22:00        Besa Peje Trepca 89 3 : 3
116       17.08. 22:00      Ferronikeli  Hajvalia 2 : 5
117       17.08. 22:00 Trepca Mitrovice   Ferizaj 1 : 0
118       17.08. 22:00       Vellaznimi   Drenica 2 : 1
119       16.08. 22:00  Kosova Vushtrri     Drita 0 : 1
120       16.08. 22:00        Prishtina    Istogu 2 : 1

carry out further formatting as needed.

Upvotes: 12

Related Questions