L3A
L3A

Reputation: 25

Increase execution speed for web scraping with for loops

The following code performs a web scraping at a travel company, where for each origin and destination are detailed the bus lines that make the trip. The idea is to put together a table with all the rows and companies they perform. The code itself works and fulfills its goal. However, I need to perform this search for a list of about 650 cities. I left the code running for about 3 hours and he held less than 5% of the cities. Any suggestions on how I can improve performance in this code?

library(RCurl)
library(jsonlite)
library(rjson)
library(dplyr)


#LOAD LIST OF CITYS
cidades <- jsonlite::fromJSON("Listas Cidades Artesp.json")
empresasXlinhas <- data.frame()

#RCURL PARAMETRES
headers = c(
  "Accept" = "application/json, text/javascript, */*; q=0.01",
  "Accept-Language" = "en-US,en;q=0.9",
  "Connection" = "keep-alive",
  "Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
  "Cookie" = "__RequestVerificationToken_L1RyYW5zcG9ydGVDb2xldGl2bw2=tY-yKlWmbZvAJzMHmITkohPiIos5XkjDBwf1ZBfP_bYWdXJMBF2Qw3z_B-LRVo0kXjdnHqDqsbZ04Zij_PM-wAf4DWVKfnQskOhqo4ANSRc1",
  "Origin" = "http://extranet.artesp.sp.gov.br",
  "Referer" = "http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino?fbclid=IwAR3_hZwajHk_iyU085S1LDTqLCOYLHIZ5K825XgPGcB4tMI0EuCJpQNrJHM",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
  "X-Requested-With" = "XMLHttpRequest"
)

#WEB SCRAPING
for (x in cidades$Codigo){
  for (y in cidades$Codigo){
    
    if (x != y){
      params = paste("origem=",x,"&destino=",y,"&__RequestVerificationToken=Z-wXmGOb9pnQbmkfcQXmChT-6uc3YfGjftHwK4HnC9SDCaKmzIafo7AI3lChBY6YDBHdpT_X98mSHGAr_YrTNgKiepKxKraGu7p6PI7dV4g1", sep ="")
      res <- postForm("http://extranet.artesp.sp.gov.br/TransporteColetivo/OrigemDestino/GetGrid", .opts=list(postfields = params, httpheader = headers, followlocation = TRUE), style = "httppost")
      if (res != "[]"){
        print(paste(x,y))
        empresa <- jsonlite::fromJSON(res)[[2]]
        empresa <- empresa %>% mutate(cod_origem = x, cod_destino = y)
        empresasXlinhas <- rbind(empresasXlinhas, empresa)
        
      }
    }
  }
}

Upvotes: 1

Views: 94

Answers (1)

robertdj
robertdj

Reputation: 1117

I don't have access to the list of cities, so I cannot run the code to check how fast it is.

Some high level considerations: Instead of downloading and aggregating in the same loop, I suggest you first download everything and later read the data. This allows you to stop your script without having to start from scratch, but instead pick up where you left.

If you want to make async calls consider the {httr2} package. Alternatively, consider rolling your own "download" function and use the {furrr} like this (please forgive my shortening of the long lines -- I think this makes it a clearer read and it is just copying your code):

First, make a dataframe/tibble of the missing files:

all_input <- tidyr::crossing(
    origin = cidades$Codigo,
    destination = cidades$Codigo
) %>%
    mutate(
        filename = <some reasonable filename>
    ) %>%
    filter(
        origin != destination,
        !fs::file_exists(filename)
    )

The filename column could for example be

fs::path("raw_data", paste(origin, destination, sep = "_"), ext = "csv")

Then make a function that downloads and saves each response

my_download_func <- function(origin, destination, filename) 
{
    headers <- c(...)
    params <- paste("origem=", origin, "&destino=", destination, ...)
    res <- postForm(...)
    empresa <- jsonlite::fromJSON(res)[[2]]
    readr::write_csv(empresa, filename)
}

Finally, go through the rows of all_input in parallel with {furrr}

future::plan(future::multisession)
furrr::pwalk(all_input, my_download_func)

One final piece of advice is to consider some error handling when parsing res.

Upvotes: 2

Related Questions