Get messy scraped data into data.frame in R

Question

I am working on scraping a certain part of a perticular website, which looks like a table but isn't (unfortunately).

I use this code...

htmldoc <- read_html("http://www.wettportal.com/quotenvergleich/valuebets/")

data <- htmldoc %>% 
  html_node(xpath='//*[(@id = "datagrid_content")]') %>%
  html_text()

# alternative css selector: "#datagrid_content"

.. and get this kind of output:

Fussball | Schweden | Cup














08.06.2016
Tipp
VQ
Buchmacher
100%
Profit




19:00
Huddinge IF - Enskede IK
1 (DNB)
1.73
Coral
1.50
45.17%


19:00
Huddinge IF - Enskede IK
1
2.25
Coral
1.93
35.00%

As you can see, it is really messy and so far I have not been able to get it neatly into a data.frame.

Anyone got an idea of how to either

select the object differently in order to obtain claner output from the start? (preferred)
clean the data in a way so that it fits into a data.frame with columns like this: Sport | Country | Competition | Date | Time | Team1 | Team2 ... ?

Thank you.

Tomas H · Accepted Answer

Well there are some things which make this a bit complicated. I use different approach for webscraping but the code down there could help you out a bit

library(RCurl)
library(XML)
library(stringr)
library(tidyr)
url<-"http://www.wettportal.com/quotenvergleich/valuebets/"
url2<-getURL(url)
parsed<-htmlParse(url2,encoding = "UTF-8")

info1<-xpathSApply(parsed,"//div[@id='datagrid_content']//h2/span[1]",xmlValue)
date<-xpathSApply(parsed,"//th/time",xmlValue)
df<-data.frame(matrix(unlist(str_split(info1," . ",n = 3)),nrow=length(info1),byrow=T))
colnames(df)<-c("Sport","Country","Competition")
df<-cbind(df,date)
time<-xpathSApply(parsed,"//div[@id='datagrid_content']//tbody/tr/td[1]",xmlValue)
teams<-xpathSApply(parsed,"//div[@id='datagrid_content']//a/span",xmlValue)
ID<-1
for (i in 2:length(teams)){
    if (teams[i]==teams[i-1]){
        x<-max(ID,na.rm=TRUE)
    } else {
        x=max(ID,na.rm=TRUE)+1
    }
    ID<-c(ID,x)

}
df2<-cbind(teams,ID,time)
df$ID<-1:nrow(df)

final<-merge(df2,df)
final<-separate(final,col = teams,into=c("team1","team2"),sep =" - ")
final<-final[ ,c(5:8,4,2,3,1)]

Get messy scraped data into data.frame in R

Answers (2)

Related Questions