Recommendation when using Sys.sleep() in R with rvest

Question

I am scraping thousands of webpages using the R package rvest. In order not to overload the server, I timed the Sys.sleep() function with 5 seconds.

It works nice until we reach a value of ~400 webpages scraped. However, beyond this value, I get nothing and all data is empty, although an error is not thrown.

I am wondering whether there is any possibility to modify the Sys.sleep() function to scrape 350 webpages by 5 seconds each, then wait for instance 5 minuts, then continue with another 350 webpages... and so on.

I was checking the Sys.sleep() function documentation, and only time appears as an argument. So, if this is not possible to be done with this function, is there any other possibility or function to deal with this problem when scraping a huge bunch of pages?

UPDATE WITH AN EXAMPLE

This is part of my code. The object links is composed of more than 8 thousand links.

title <- vector("character", length = length(links))
short_description <- vector("character", length = length(links))

for(i in seq_along(links)){
  Sys.sleep(5)
  aff_link <- read_html(links[i])
  title[i] <- aff_link %>%
    html_nodes("title") %>% 
    html_text()
  short_description[i] <- aff_link %>%
    html_nodes(".clp-lead__headline") %>% 
    html_text()
}

Recommendation when using Sys.sleep() in R with rvest

Answers (1)

Related Questions