digitalmaps
digitalmaps

Reputation: 2905

Handling HTML web-scraping errors in R with XML package

I am trying to scrape a web page http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html like this one, and using the following code, I receive an error suggesting that the HTML is improper:

library(RCurl)
library(XML)
weather <- getURL("http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html")
doc <- htmlParse(weather)

I have seen this post which demonstrates how to use Internet Explorer and the rcom package to fix improperly formed HTML and then feed it to the parser. However the HTML in question passes the validation at http://validator.w3.org.

What other ways are there for handling an HTML parse-related error like this one with the XML package?

Upvotes: 0

Views: 672

Answers (1)

Tyler Rinker
Tyler Rinker

Reputation: 110062

Give this a whirl and see if it does what you're after:

library(RCurl)
library(XML)
url   <- "http://www.weatheroffice.gc.ca/city/pages/on-135_metric_e.html"
doc   <- htmlTreeParse(url, useInternalNodes=TRUE)

I also suggest you check out these resources:

  1. talkstats.com thread on web scraping (great beginner examples)
  2. w3schools.com site on html stuff (very helpful)

Upvotes: 2

Related Questions