Need help optimizing for loop in large webscraping task

Question

I am working on a solo project that starts with generating stock data using the rvest package for webscraping and storing it in a datatable.

The loop pulls a portion of the stock tickers from a website and stores it in a dataframe. My code is extremely archaic (I think), partly because of the way the website is organized. The website arranges the symbols on pages alphabetically, with a different number of tickers on each page (1 page per letter) - yes I had to count how many per page. What I ended up with works but runs extremely slowly:

#GET AMEX tickers
alphabet <- c('A','B','C','D','E','F','G','H','I','J','K',
          'L','M','N','O','P','Q','R','S','T','U','V',
          'W','X','Y','Z')
#start at 2
lengths <- c(65,96,89,125,161,154,86,62,173,83,26,43,62,51,
         37,126,25,81,149,52,77,74,34,50,8,11)

amexurls <- paste0("http://findata.co.nz/markets/AMEX/symbols/",toupper(alphabet),".htm",
 sep = "")

iterator <- 0
for(j in 1:26){
  url <- amexurls[j]
  for(k in 2:lengths[j]){

html <- read_html(as.character(url))
iterator 
test <- html_nodes(html,as.character(paste0("tr:nth-child(",k,") a")))
test <- toString(test)
test <-  gsub("<[^>]+>", "", test)
amexsymbols[k-2+iterator] <- test

  }
  iterator <- iterator + lengths[j] + 1
}

The for loop above takes over an hour to run. I think it may be mainly because of the many calls to the internet.

I'm trying to get better about understanding vectorization and other tricks to maximize R's efficiency, especially on a big project like this.

Things I've tried/seen:

-I have taken as much out of the body of the loop (the paste0 line specifically

-Switching from dataframe to datatable

-In a much older post, user @Gregor (thanks again) showed me I can take advantage of paste0 being a vectorized function, and hence amexurls doesn't use a for loop to store - but unfortunately this isn't the slow part of the code

This is just a snippit of much a much larger web scraping code. If I can optimize this chunk, I can apply it to the rest. Any improvements to my code or tips/tricks would be greatly appreciated. Thanks for your time.

Matt Jewett · Accepted Answer

I can't test this right now, due to firewall restrictions. But I would recommend trying something using the html_table() function from rvest to collect the data. It would be much more dynamic than manually specifying the number of stocks on each page, and looping through each row individually.

library(rvest)

amexurls <- paste0("http://findata.co.nz/markets/AMEX/symbols/", LETTERS,".htm")

ldf <- list()
iterator <- 0

for(url in amexurls){
  iterator <- iterator + 1
  html <- read_html(url)
  ldf[[iterator]] <- html_table(html_nodes(html, "table")[[2]])
}

df <- do.call(rbind, ldf)

Need help optimizing for loop in large webscraping task

Answers (1)

Related Questions