antecessor
antecessor

Reputation: 2800

Recommendation when using Sys.sleep() in R with rvest

I am scraping thousands of webpages using the R package rvest. In order not to overload the server, I timed the Sys.sleep() function with 5 seconds.

It works nice until we reach a value of ~400 webpages scraped. However, beyond this value, I get nothing and all data is empty, although an error is not thrown.

I am wondering whether there is any possibility to modify the Sys.sleep() function to scrape 350 webpages by 5 seconds each, then wait for instance 5 minuts, then continue with another 350 webpages... and so on.

I was checking the Sys.sleep() function documentation, and only time appears as an argument. So, if this is not possible to be done with this function, is there any other possibility or function to deal with this problem when scraping a huge bunch of pages?

UPDATE WITH AN EXAMPLE

This is part of my code. The object links is composed of more than 8 thousand links.

title <- vector("character", length = length(links))
short_description <- vector("character", length = length(links))

for(i in seq_along(links)){
  Sys.sleep(5)
  aff_link <- read_html(links[i])
  title[i] <- aff_link %>%
    html_nodes("title") %>% 
    html_text()
  short_description[i] <- aff_link %>%
    html_nodes(".clp-lead__headline") %>% 
    html_text()
}

Upvotes: 0

Views: 3850

Answers (1)

Spacedman
Spacedman

Reputation: 94317

You could add a check on the modulus of a loop variable and do an extra sleep every N iterations. Example:

> for(i in 1:100){
  message("Getting page ",i)
  Sys.sleep(5)
  if((i %% 10) == 0){
    message("taking a break")
    Sys.sleep(10)
  }
 }

Every 10 iterations the expression i %% 10 is TRUE and you get an extra 10 seconds sleep.

I can think of more complex solutions but this might work for you.

One other possibility is to check if a page returns any data, and if not, sleep twice as long and try again, repeating this a number of times. Here's some semi-pseudocode:

get_page = function(page){
   sleep = 5
   for(try in 1:5){
     html = get_content(page)
     if(download_okay(html)){
      return(html)
     }
     sleep = sleep * 2
     Sys.sleep(sleep)
    }
    return("I tried - but I failed!")
}

Some web page getters like CURL will do this automatically with the right options - there may be a way to wangle that into your code too.

Upvotes: 1

Related Questions