tomw
tomw

Reputation: 3160

Using readHTMLtable from XML package to scrape site, uncertain error message

I'm using the XML package to scrape a list of websites. Specifically, i'm taking ratings from a list of candidates, at the following site: votesmart.

The candidates' pages are arranged in a numerical order, from 1 upwards. My first attempt, to scrape the first 50 candidates, looks like this

library(xml)
library(plyr)

url <- paste("http://www.votesmart.org/candidate/evaluations/", 1:50 , sep = "")
res <- llply(url, function(i) readHTMLtable(i))

But there are a couple of problems--for instance, the 25th page in this sequence generates a 404 "url not found" error. I've addressed this by first getting a data frame of the count of XML errors for each page in a sequence, and then excluding the pages which have a single error. Specifically

errors <- ldply(url, function(i) length(getXMLErrors(i)))
url2 <- url[which(errors$V1 > 1)]
res2 <- llply(url2, function(i) readHTMLTable(i))

In this way, I've excluded the 404 generating URLs from this list.

However, there's still a problem, caused by numerous pages in the list, which cause this llply commands to fail. The following is an example

readHTMLTable("http://www.votesmart.org/candidate/evaluations/6")

which results in the error

Error in seq.default(length = max(numEls)) : 
  length must be non-negative number
In addition: Warning message:
In max(numEls) : no non-missing arguments to max; returning -Inf

However, these pages generate the same error count from the getXMLErrors command as the working pages, so I'm unable to distinguish between them on this front.

My question is--what does this error mean, and is there any way to get readHTMLTable to return an empty list for these pages, rather than an error? Failing that, is there a way I can my llply statement to check these pages and skip those which result in an error?

Upvotes: 1

Views: 931

Answers (1)

joran
joran

Reputation: 173627

Why not just some simple error handling?

res <- llply(url, function(i) try(readHTMLTable(i)))

Upvotes: 3

Related Questions