Reputation: 2905
I am trying to scrape a web page http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html like this one, and using the following code, I receive an error suggesting that the HTML is improper:
library(RCurl)
library(XML)
weather <- getURL("http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html")
doc <- htmlParse(weather)
I have seen this post which demonstrates how to use Internet Explorer and the rcom
package to fix improperly formed HTML and then feed it to the parser. However the HTML in question passes the validation at http://validator.w3.org.
What other ways are there for handling an HTML parse-related error like this one with the XML package?
Upvotes: 0
Views: 672
Reputation: 110062
Give this a whirl and see if it does what you're after:
library(RCurl)
library(XML)
url <- "http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html"
doc <- htmlTreeParse(url, useInternalNodes=TRUE)
I also suggest you check out these resources:
Upvotes: 2