Anna Jones
Anna Jones

Reputation: 111

Using sys.sleep in rvest

I'm trying to scrape some data from websites with rvest. I have a tibble of thousands of URLs, and I need to extract one piece of data from each URL. In order to not be blocked by the main site I'm visiting, I need to rest about 2 minutes after each 200 URLs I visit (learned this via trial and error). I'm wondering how I can use sys.sleep to do this.

My current code is below. I am going to each URL in url_tibble and pulling data (".verified").

# Function to extract data
get_data <- function(x) {
  read_html(x) %>%
    html_nodes(".verified") %>%
    html_attr("href") 
}

# Extract data
data_I_need <- url_tibble %>%
  mutate(profile = map(url, ~ get_data(.x)),)

This code works for a limited number of URLS, until I get blocked for trying to scrape from the site too quickly. To avoid being blocked, I'd like to pause for 2 minutes after each 200 URLs using sys.sleep. Can you help me figure out how to do this?

The best recommendation I found for how to do this was from the solution posted on Recommendation when using Sys.sleep() in R with rvest, but I can't figure out how to integrate the solution with my code. This solution uses loops instead of map. I tried doing something like this:

output <- vector(length = length(url_tibble$url))
                 
for(i in 1:length(url_tibble$url)) {
  data_I_need <-  read_html(url_tibble$url[i]) %>%
          html_nodes(".verified") %>%
          html_attr("href") 
  output[i] <- data_I_need
    if((i %% 200) == 0){
      Sys.sleep(160)
    }
  } 

However, this does not work either, and I receive an error message.

Upvotes: 2

Views: 562

Answers (1)

NelsonGon
NelsonGon

Reputation: 13319

We can lapply in lieu of a loop. Also, I have added an https:// to each URL such that read_html recognises them as links not files. We should replace 2 with 200 for the actual data.

 lapply(1:length(url_tibble$url), function(x){
  if(x%%2 == 0){
    print(paste0("Sleeping at ", x))
    Sys.sleep(20)
  }
  read_html(paste0("https://",url_tibble$url[x])) %>%
    html_nodes(".verified") %>%
    html_attr("href") 
})

Output (truncated)

[1] "Sleeping at 2"
[1] "Sleeping at 4"
[1] "Sleeping at 6"
[[1]]
 [1] "https://www.psychologytoday.com/us/therapists/aak-bright-start-rego-park-ny/936718"                   
 [2] "https://www.psychologytoday.com/us/therapists/leslie-aaron-new-york-ny/148793"                        
 [3] "https://www.psychologytoday.com/us/therapists/lindsay-aaron-frieman-new-york-ny/761657"               
 [4] "https://www.psychologytoday.com/us/therapists/fay-m-aaronson-brooklyn-ny/840861"                      
 [5] "https://www.psychologytoday.com/us/therapists/anita-aasen-staten-island-ny/291614"                    
 [6] "https://www.psychologytoday.com/us/therapists/aask-therapeutic-services-fishkill-ny/185423"           
 [7] "https://www.psychologytoday.com/us/therapists/amanda-abady-brooklyn-ny/935849"                        
 [8] "https://www.psychologytoday.com/us/therapists/denise-abatemarco-new-york-ny/143678"                   
 [9] "https://www.psychologytoday.com/us/therapists/raya-abat-robinson-new-york-ny/810730"

Upvotes: 1

Related Questions