SoloSpirit
SoloSpirit

Reputation: 87

Loading thousands of URLS for ML-related web-scraping - code is VERY slow, need efficiency tips

I am building a dataset by webscraping data from various websites for a stock signal prediction algorithm. The way my algorithm is set up involves layering for-loops and loading thousands of URLs because each link refers to stock and its various quantitative statistics. Need help increasing processing speed. Any tips?

I have talked to a few different people about how to solve this and some people have recommended vectorization, but that is new to me. I have also tried switching to data table, but I haven't seen much change. The eval lines are a trick I learned to manipulate the data the way I want but I figure it may be a reason why it is slow, but I doubt it. I have also wondered about remote processing, but this probably goes beyond the R world.

For the code below, imagine there are 4 more sections like this for other variables from different websites I want to load, and all of these blocks are in an even larger for-loop because I'm generating two datasets (set = c("training, testing")).

The tryCatch is for preventing the code from stopping if it encounters an error loading a URL. The urls are loaded into a list, one for each stock - so they are pretty long. The second for-loop scrapes the data from the URLS and posts them formatted correctly in a data frame.

library(quantmod)
library(readr)
library(rvest)
library(data.table)

urlsmacd <-  vector("list", length = 
eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep = "")))))
for(h in 1:eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep = 
""))))){
  urlsmacd[h] <- paste0('http://www.stockta.com/cgi-bin/analysis.pl? 
symb=',eval(parse(text=as.name(paste0(set[y],"[,1][h]", sep = 
"")))),'&mode=table&table=macd&num1=1', sep = '')
}

for(j in 1:eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep = 
""))))){
  tryCatch({
html <- read_html(urlsmacd[[j]])

#get macd html
MACD26 <- html_nodes(html,'.borderTd~ .borderTd+ .borderTd:nth-child(3) 
font')
MACD26 <- toString(MACD26)
MACD26 <-  gsub("<[^>]+>", "", MACD26)
if(!is.na(MACD26)){
  MACD26 <- as.double(MACD26)
}
eval(parse(text=as.name(paste0(set[y],"$","MACD26[j] <- MACD26"))))

MACD12 <- html_nodes(html,'.borderTd+ .borderTd:nth-child(2) font')
MACD12 <- toString(MACD12)
MACD12 <-  gsub("<[^>]+>", "",MACD12)
if(!is.na(MACD12)){
  MACD12 <- as.double(MACD12)
}
eval(parse(text=as.name(paste0(set[y],"$","MACD12[j] <- MACD12"))))

  }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
    }

All of this said and done, this process takes around 6 hours. At this rate, shaving hours off this process would make progressing my project so much easier.

Thank you people of StackOverflow for your support.

Upvotes: 1

Views: 82

Answers (1)

Santiago Capobianco
Santiago Capobianco

Reputation: 871

Check the doParallel package. It has a parallel implementation of the foreach loop. It let you use more cores of your CPU (if there are available) to perform parallel R sessions for a defined function. For example:

library(doParallel)  
no_cores <- detectCores() - 1  
cl <- makeCluster(no_cores, type="FORK")  
registerDoParallel(cl)  
result <- foreach(i=10:10000) %dopar% 
getPrimeNumbers(i)

If the urls are stored in a list, there is also a parallel lapply.

The example is taken from this great post:

https://www.r-bloggers.com/lets-be-faster-and-more-parallel-in-r-with-doparallel-package/amp/

Hope it helps.

Upvotes: 1

Related Questions