Reputation: 1
I need help to speed up a web scraping with R. I need to scrape some data from about 23.000 web pages but I need to do it faster than 2 hours (!) and I don't know how to improve my script to reach the goal (I'm new with R!). Here's an example of the page: https://"sample"/46351 and every page is characterized by a code at the end of the url. In Codes$id there are all codes. Can anyone give me any advice? Are there any functions to speed up all? Here attached the code. Thanks a lot for the help!
> cr <- c()
> pr <- c()
> vig <- c()
> ges <- c()
> tabellafinale <- NULL
> tabellafinale <- data.table(ges,cr, pr, vig,stringsAsFactors = FALSE)
> imp <- Codes$id
> str1 <- "https://"sample"/"
> for (p in 1:length(imp)) {
+
+ try(
+ {
+ str2 <- imp[p]
+ str3 <- paste(str1,str2,sep="")
+ page<-read_html(str3)
+ carr<-html_text(html_nodes(page,".span3"))
+ prez<-html_text(html_nodes(page,".carbFormat"))
+ viag<-html_text(html_nodes(page,".span5"))
+ gest <- str2
+ carr<-gsub("\n","",carr)
+ via<-gsub("\n","",via)
+ pre<-gsub("\n","",pre)
+ carr<-gsub("\r","",carr)
+ via<-gsub("\r","",via)
+ pre<-gsub("\r","",pre)
+ carr<-gsub("\t","",carr)
+ via<-gsub("\t","",via)
+ pre<-gsub("\t","",pre)
+ car <- data.table(carr)
+ n <- length(carr)
+ carb <- carr[7:n]
+ cr <- data.table(carb)
+ prezzi <- data.table(pre)
+ vigore <- data.table(via)
+ ges <- data.table(gest)
+ oss <- data.table(ges,cr, pr, vig, stringsAsFactors = FALSE)
+ tabellafinale <- rbind(tabellafinale, oss)
+ }
+ , silent=T
+ )
+ closeAllConnections()
+ }
Upvotes: 0
Views: 632
Reputation: 388797
You can try :
library(rvest)
tabellafinale <- do.call(rbind, lapply(Codes$id, function(str2) {
try({
str3 <- paste0(str1,str2)
page<-read_html(str3)
carr<-html_text(html_nodes(page,".span3"))
prez<-html_text(html_nodes(page,".carbFormat"))
viag<-html_text(html_nodes(page,".span5"))
carr<- gsub("[\n\r\t]","",carr)
prez<- gsub("[\n\r\t]","",prez)
viag<- gsub("[\n\r\t]","",viag)
carb <- carr[7:length(carr)]
data.frame(str2,carb, prez, viag, stringsAsFactors = FALSE)
}, silent = TRUE)
}))
You can replace do.call
rbind
+ lapply
to map_df
from purrr
.
tabellafinale <- map_df(Codes$id, function(str2) {
.....rest of the code
.....as it is
})
Upvotes: 2