Def
Def

Reputation: 21

RCurl, error: connection time out

I use the XML and RCurl packages of R to get data from a website. The script needs to scrap 6,000,000 pages, so I created a loop.

for (page in c(1:6000000)){

 my_url = paste('http://webpage.....')
 page1 <- getURL(my_url, encoding="UTF-8")
 mydata <- htmlParse(page1, asText=TRUE, encoding="UTF-8")
 title <- xpathSApply(mydata, '//head/title', xmlValue, simplify = TRUE, encoding="UTF-8")

.....
.....
.....}

However, after a few loops I get the error message:

Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : connection time out

The problem is that I don't understand how the "time out" works. Sometimes the process ends after 700 pages while other times after 1000, 1200 etc pages. The step is not stable. When the connection is timed out, I can't access this webpage from my laptop, for 15 minutes. I thought of using a command to delay the process for 15 minutes every 1000 pages scrapped

if(page==1000) Sys.sleep(901)

, but nothing changed.

Any ideas what is going wrong and how to overcome this?

Upvotes: 1

Views: 2741

Answers (2)

Chernoff
Chernoff

Reputation: 2622

You could make a call in R to a native installation of curl using the command System(). This way, you get access to all the curl options not currently supported by RCurl such as --retry <num>. The option --retry <num> will cause an issued curl query to repeatedly try again at ever greater lengths of time after each failure, i.e. retry 1 second after first failure, 2 seconds after second failure, 4 seconds after third failure, and so on. Other time control options are also available at the cURL site http://curl.haxx.se/docs/manpage.html.

Upvotes: 2

Def
Def

Reputation: 21

I solved it. Just added Sys.sleep(1) to each iteration.

Upvotes: 1

Related Questions