Kaitlin
Kaitlin

Reputation: 59

What could be the reason for a webscraping timeout error?

I am trying to webscrape some legislative data using 'rvest' in R. I am trying to scrape data over several legislative sessions and so I have built code that has been running smoothly up until recent legislative sessions. Here is the code I have been using:

summary2 <- data.frame(matrix(nrow=2, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- sprintf('%0.4d', 2:9046)


for (i in k) {
  webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2016"))
  billno <- html_nodes(webpage, 'h1')
  billno_text <- html_text(billno)
  
  billsum <- html_nodes(webpage, '.interno')
  billsum_text <- html_text(billsum)
  
  billsum_text <- gsub("\n", "", billsum_text)
  billsum_text <- gsub("\t", "", billsum_text)
  billsum_text <- gsub("    ", "", billsum_text)
  
  link <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2016"))
  type <- html_nodes(link, 'h3')
  type_text <- html_text(type)
  
  
  table <-html_node(link, "table.table.table-bordered tbody")
  
  table_text <- html_text(table)
  
  table_text <- gsub("\n", "", table_text)
  table_text <- gsub("\t", "", table_text)
  table_text <- gsub("", "", table_text)
  
  summary2[i, 1] <- billno_text
  summary2[i, 2] <- billsum_text
  summary2[i, 3] <- type_text
  summary2[i, 4] <- table_text
}


summary2$year <- 2016

write.csv(summary2,'2016-bills_cong_17.csv')

I am running this as a loop to loop over individual pieces of legislation and it seems like randomly the code will stop working and I get the following error (or some variation of this error depending):

Error in open.connection(x, "rb") : 
  Timeout was reached: [www.hcdn.gob.ar] Operation timed out after 1875835 milliseconds with 0 out of 0 bytes received

It doesn't seem to be systematically stopping at a particular point or piece of legislation in the loop. I'm assuming this has something to do with the connection being lost to the website, although I've run this code on a computer with an Ethernet connection as well. I would really appreciate any help in debugging this and getting this code to work!

Upvotes: 0

Views: 787

Answers (1)

Bensstats
Bensstats

Reputation: 1056

Its likely that you are hitting the website's servers to fast and you are getting blocked as a result for requesting too quickly. Try adding a Sys.sleep() between iterations and requests.

Upvotes: 1

Related Questions