Reputation: 87
I am building a dataset by webscraping data from various websites for a stock signal prediction algorithm. The way my algorithm is set up involves layering for-loops and loading thousands of URLs because each link refers to stock and its various quantitative statistics. Need help increasing processing speed. Any tips?
I have talked to a few different people about how to solve this and some people have recommended vectorization, but that is new to me. I have also tried switching to data table, but I haven't seen much change. The eval lines are a trick I learned to manipulate the data the way I want but I figure it may be a reason why it is slow, but I doubt it. I have also wondered about remote processing, but this probably goes beyond the R world.
For the code below, imagine there are 4 more sections like this for other variables from different websites I want to load, and all of these blocks are in an even larger for-loop because I'm generating two datasets (set = c("training, testing")).
The tryCatch is for preventing the code from stopping if it encounters an error loading a URL. The urls are loaded into a list, one for each stock - so they are pretty long. The second for-loop scrapes the data from the URLS and posts them formatted correctly in a data frame.
library(quantmod)
library(readr)
library(rvest)
library(data.table)
urlsmacd <- vector("list", length =
eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep = "")))))
for(h in 1:eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep =
""))))){
urlsmacd[h] <- paste0('http://www.stockta.com/cgi-bin/analysis.pl?
symb=',eval(parse(text=as.name(paste0(set[y],"[,1][h]", sep =
"")))),'&mode=table&table=macd&num1=1', sep = '')
}
for(j in 1:eval(parse(text=as.name(paste0("nrow(", set[y], ")", sep =
""))))){
tryCatch({
html <- read_html(urlsmacd[[j]])
#get macd html
MACD26 <- html_nodes(html,'.borderTd~ .borderTd+ .borderTd:nth-child(3)
font')
MACD26 <- toString(MACD26)
MACD26 <- gsub("<[^>]+>", "", MACD26)
if(!is.na(MACD26)){
MACD26 <- as.double(MACD26)
}
eval(parse(text=as.name(paste0(set[y],"$","MACD26[j] <- MACD26"))))
MACD12 <- html_nodes(html,'.borderTd+ .borderTd:nth-child(2) font')
MACD12 <- toString(MACD12)
MACD12 <- gsub("<[^>]+>", "",MACD12)
if(!is.na(MACD12)){
MACD12 <- as.double(MACD12)
}
eval(parse(text=as.name(paste0(set[y],"$","MACD12[j] <- MACD12"))))
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
All of this said and done, this process takes around 6 hours. At this rate, shaving hours off this process would make progressing my project so much easier.
Thank you people of StackOverflow for your support.
Upvotes: 1
Views: 82
Reputation: 871
Check the doParallel package. It has a parallel implementation of the foreach loop. It let you use more cores of your CPU (if there are available) to perform parallel R sessions for a defined function. For example:
library(doParallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores, type="FORK")
registerDoParallel(cl)
result <- foreach(i=10:10000) %dopar%
getPrimeNumbers(i)
If the urls are stored in a list, there is also a parallel lapply.
The example is taken from this great post:
https://www.r-bloggers.com/lets-be-faster-and-more-parallel-in-r-with-doparallel-package/amp/
Hope it helps.
Upvotes: 1