Reputation: 21
I use the XML and RCurl packages of R to get data from a website. The script needs to scrap 6,000,000 pages, so I created a loop.
for (page in c(1:6000000)){
my_url = paste('http://webpage.....')
page1 <- getURL(my_url, encoding="UTF-8")
mydata <- htmlParse(page1, asText=TRUE, encoding="UTF-8")
title <- xpathSApply(mydata, '//head/title', xmlValue, simplify = TRUE, encoding="UTF-8")
.....
.....
.....}
However, after a few loops I get the error message:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : connection time out
The problem is that I don't understand how the "time out" works. Sometimes the process ends after 700 pages while other times after 1000, 1200 etc pages. The step is not stable. When the connection is timed out, I can't access this webpage from my laptop, for 15 minutes. I thought of using a command to delay the process for 15 minutes every 1000 pages scrapped
if(page==1000) Sys.sleep(901)
, but nothing changed.
Any ideas what is going wrong and how to overcome this?
Upvotes: 1
Views: 2741
Reputation: 2622
You could make a call in R to a native installation of curl
using the command System()
. This way, you get access to all the curl
options not currently supported by RCurl
such as --retry <num>
. The option --retry <num>
will cause an issued curl
query to repeatedly try again at ever greater lengths of time after each failure, i.e. retry 1 second after first failure, 2 seconds after second failure, 4 seconds after third failure, and so on. Other time control options are also available at the cURL site http://curl.haxx.se/docs/manpage.html.
Upvotes: 2