Handling HTML web-scraping errors in R with XML package

Question

I am trying to scrape a web page http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html like this one, and using the following code, I receive an error suggesting that the HTML is improper:

library(RCurl)
library(XML)
weather <- getURL("http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html")
doc <- htmlParse(weather)

I have seen this post which demonstrates how to use Internet Explorer and the rcom package to fix improperly formed HTML and then feed it to the parser. However the HTML in question passes the validation at http://validator.w3.org.

What other ways are there for handling an HTML parse-related error like this one with the XML package?

Tyler Rinker · Accepted Answer

Give this a whirl and see if it does what you're after:

library(RCurl)
library(XML)
url   <- "http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html"
doc   <- htmlTreeParse(url, useInternalNodes=TRUE)

I also suggest you check out these resources:

talkstats.com thread on web scraping (great beginner examples)
w3schools.com site on html stuff (very helpful)

Handling HTML web-scraping errors in R with XML package

Answers (1)

Related Questions