Reputation: 25
I am trying to read in all the tables from the website "http://www.lassen.co.nz/s14tab.php#hrh". My code to do this looks as follows:
library(XML)
library(RCurl)
url<-"http://www.lassen.co.nz/s14tab.php#hrh"
data<-getURL(url)
data<-htmlParse(data)
tables<-readHTMLTable(data)
The table indicating "Team Ranking Points" appears to not parse correctly and therefore is shown as NULL. I have tried using the scrapeR package but had the same result. Any help would be greatly appreciated.
Upvotes: 1
Views: 100
Reputation: 78792
The idiot (I'm not usually that harsh but that page belongs on myspace or geocities & would be a great "prosecution Exhibit A" for the need to have a license to put HTML on the internet) who made that page decided that [s]he could "make up" new rules for commenting out parts of HTML.
This gem:
<TABLE border=0 cellspacing=2 cellpadding=3 /*style="border: 1px solid #000;"*/>
appears twice. While the xml2
C library is good at handling horrible HTML, this tosses it for a bit of a loop. So, we have to deal with the creative commenting first:
library(rvest)
pg <- readLines("http://www.lassen.co.nz/s14tab.php")
pg <- gsub("/*style", "style", pg)
pg <- gsub("*/>", ">", pg)
pg <- read_html(paste0(pg, sep="", collapse=""))
html_table(html_nodes(pg, "h2 + table"), fill=TRUE)
The same person who can't follow proper HTML coding guidelines seems also to have never heard of the <div>
tag, so you'll have to do some cleanup of tables 2 & 3.
If they ever change the formatting (unlikely given the ancient processes this thing is built on), the h2 + table
will need to be updated to better target those three tables.
Upvotes: 1