Reputation: 59
I am trying to webscrape some legislative data using 'rvest' in R. I am trying to scrape data over several legislative sessions and so I have built code that has been running smoothly up until recent legislative sessions. Here is the code I have been using:
summary2 <- data.frame(matrix(nrow=2, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- sprintf('%0.4d', 2:9046)
for (i in k) {
webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2016"))
billno <- html_nodes(webpage, 'h1')
billno_text <- html_text(billno)
billsum <- html_nodes(webpage, '.interno')
billsum_text <- html_text(billsum)
billsum_text <- gsub("\n", "", billsum_text)
billsum_text <- gsub("\t", "", billsum_text)
billsum_text <- gsub(" ", "", billsum_text)
link <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2016"))
type <- html_nodes(link, 'h3')
type_text <- html_text(type)
table <-html_node(link, "table.table.table-bordered tbody")
table_text <- html_text(table)
table_text <- gsub("\n", "", table_text)
table_text <- gsub("\t", "", table_text)
table_text <- gsub("", "", table_text)
summary2[i, 1] <- billno_text
summary2[i, 2] <- billsum_text
summary2[i, 3] <- type_text
summary2[i, 4] <- table_text
}
summary2$year <- 2016
write.csv(summary2,'2016-bills_cong_17.csv')
I am running this as a loop to loop over individual pieces of legislation and it seems like randomly the code will stop working and I get the following error (or some variation of this error depending):
Error in open.connection(x, "rb") :
Timeout was reached: [www.hcdn.gob.ar] Operation timed out after 1875835 milliseconds with 0 out of 0 bytes received
It doesn't seem to be systematically stopping at a particular point or piece of legislation in the loop. I'm assuming this has something to do with the connection being lost to the website, although I've run this code on a computer with an Ethernet connection as well. I would really appreciate any help in debugging this and getting this code to work!
Upvotes: 0
Views: 787
Reputation: 1056
Its likely that you are hitting the website's servers to fast and you are getting blocked as a result for requesting too quickly. Try adding a Sys.sleep()
between iterations and requests.
Upvotes: 1