Mark Einhorn
Mark Einhorn

Reputation: 25

HTML table does not parse correctly in R

I am trying to read in all the tables from the website "http://www.lassen.co.nz/s14tab.php#hrh". My code to do this looks as follows:

library(XML)
library(RCurl)
url<-"http://www.lassen.co.nz/s14tab.php#hrh"
data<-getURL(url)
data<-htmlParse(data)
tables<-readHTMLTable(data)

The table indicating "Team Ranking Points" appears to not parse correctly and therefore is shown as NULL. I have tried using the scrapeR package but had the same result. Any help would be greatly appreciated.

Upvotes: 1

Views: 100

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78792

The idiot (I'm not usually that harsh but that page belongs on myspace or geocities & would be a great "prosecution Exhibit A" for the need to have a license to put HTML on the internet) who made that page decided that [s]he could "make up" new rules for commenting out parts of HTML.

This gem:

<TABLE border=0 cellspacing=2 cellpadding=3 /*style="border: 1px solid #000;"*/>

appears twice. While the xml2 C library is good at handling horrible HTML, this tosses it for a bit of a loop. So, we have to deal with the creative commenting first:

library(rvest)

pg <- readLines("http://www.lassen.co.nz/s14tab.php")
pg <- gsub("/*style", "style", pg)
pg <- gsub("*/>", ">", pg)

pg <- read_html(paste0(pg, sep="", collapse=""))
html_table(html_nodes(pg, "h2 + table"), fill=TRUE)

The same person who can't follow proper HTML coding guidelines seems also to have never heard of the <div> tag, so you'll have to do some cleanup of tables 2 & 3.

If they ever change the formatting (unlikely given the ancient processes this thing is built on), the h2 + table will need to be updated to better target those three tables.

Upvotes: 1

Related Questions