G_scrape7
G_scrape7

Reputation: 1

Faster web scraping with R

I need help to speed up a web scraping with R. I need to scrape some data from about 23.000 web pages but I need to do it faster than 2 hours (!) and I don't know how to improve my script to reach the goal (I'm new with R!). Here's an example of the page: https://"sample"/46351 and every page is characterized by a code at the end of the url. In Codes$id there are all codes. Can anyone give me any advice? Are there any functions to speed up all? Here attached the code. Thanks a lot for the help!

> cr <- c()
> pr <- c()
> vig <- c()
> ges <- c()
> tabellafinale <- NULL
> tabellafinale <- data.table(ges,cr, pr, vig,stringsAsFactors = FALSE)
> imp <- Codes$id
> str1 <- "https://"sample"/"
> for (p in 1:length(imp)) {
+   
+   try(   
+     {
+       str2 <- imp[p]
+       str3 <- paste(str1,str2,sep="")
+       page<-read_html(str3)    
+       carr<-html_text(html_nodes(page,".span3"))    
+       prez<-html_text(html_nodes(page,".carbFormat"))    
+       viag<-html_text(html_nodes(page,".span5"))    
+       gest <- str2    
+       carr<-gsub("\n","",carr)    
+       via<-gsub("\n","",via)
+       pre<-gsub("\n","",pre)
+       carr<-gsub("\r","",carr)
+       via<-gsub("\r","",via)
+       pre<-gsub("\r","",pre)
+       carr<-gsub("\t","",carr)
+       via<-gsub("\t","",via)
+       pre<-gsub("\t","",pre)
+       car <- data.table(carr)
+       n <- length(carr)
+       carb <- carr[7:n]
+       cr <- data.table(carb)
+       prezzi <- data.table(pre)
+       vigore <- data.table(via)
+       ges <- data.table(gest)
+       oss <- data.table(ges,cr, pr, vig, stringsAsFactors = FALSE)
+       tabellafinale <- rbind(tabellafinale, oss)   
+     }
+     , silent=T
+   )
+   closeAllConnections()
+ }

Upvotes: 0

Views: 632

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388797

You can try :

library(rvest)

tabellafinale <- do.call(rbind, lapply(Codes$id, function(str2) {
  try({
    str3 <- paste0(str1,str2)
    page<-read_html(str3)
    carr<-html_text(html_nodes(page,".span3"))
    prez<-html_text(html_nodes(page,".carbFormat"))
    viag<-html_text(html_nodes(page,".span5"))
    carr<- gsub("[\n\r\t]","",carr)
    prez<- gsub("[\n\r\t]","",prez)
    viag<- gsub("[\n\r\t]","",viag)
    carb <- carr[7:length(carr)]
    data.frame(str2,carb, prez, viag, stringsAsFactors = FALSE)
  }, silent = TRUE)
}))

You can replace do.call rbind + lapply to map_df from purrr.

tabellafinale <- map_df(Codes$id, function(str2) {
   .....rest of the code
   .....as it is
})

Upvotes: 2

Related Questions