Reputation: 3160
I'm using the XML
package to scrape a list of websites. Specifically, i'm taking ratings from a list of candidates, at the following site: votesmart.
The candidates' pages are arranged in a numerical order, from 1 upwards. My first attempt, to scrape the first 50 candidates, looks like this
library(xml)
library(plyr)
url <- paste("http://www.votesmart.org/candidate/evaluations/", 1:50 , sep = "")
res <- llply(url, function(i) readHTMLtable(i))
But there are a couple of problems--for instance, the 25th page in this sequence generates a 404 "url not found"
error. I've addressed this by first getting a data frame of the count of XML
errors for each page in a sequence, and then excluding the pages which have a single error. Specifically
errors <- ldply(url, function(i) length(getXMLErrors(i)))
url2 <- url[which(errors$V1 > 1)]
res2 <- llply(url2, function(i) readHTMLTable(i))
In this way, I've excluded the 404 generating URLs from this list.
However, there's still a problem, caused by numerous pages in the list, which cause this llply commands to fail. The following is an example
readHTMLTable("http://www.votesmart.org/candidate/evaluations/6")
which results in the error
Error in seq.default(length = max(numEls)) :
length must be non-negative number
In addition: Warning message:
In max(numEls) : no non-missing arguments to max; returning -Inf
However, these pages generate the same error count from the getXMLErrors command as the working pages, so I'm unable to distinguish between them on this front.
My question is--what does this error mean, and is there any way to get readHTMLTable to return an empty list for these pages, rather than an error? Failing that, is there a way I can my llply statement to check these pages and skip those which result in an error?
Upvotes: 1
Views: 931
Reputation: 173627
Why not just some simple error handling?
res <- llply(url, function(i) try(readHTMLTable(i)))
Upvotes: 3