csmontt
csmontt

Reputation: 624

Download files when exact URL is not known

From this link, I´m trying to download multiple pdf files, but I can´t get the exact URL for each file.

To access one of the pdf files, you could click on "Región de Arica y Parinacota" and then click on "Arica". Then, you can check that the url is http://cdn.servel.cl/padronesauditados/padron/A1501001.pdf, if you click on the next link "Camarones" you now noticed that the URL is http://cdn.servel.cl/padronesauditados/padron/A1501002.pdf

I checked more URLs, and they all have a similar pattern:

"A" + "two digit number from 1 to 15" + "two digit number of unknown range" + "three digit number of unknown range"

Even though the URL examples I showed seem to suggest that the file names are named sequentally, this is not always the case.

What I did to be able to download all the files despite not knowing the exact URLs I did the following:

1) I made a for loop in order to write all possible file names based on the pattern I describe above, i.e, A0101001.pdf, A0101002.pdf....A1599999.pdf

library(downloader)
library(stringr)
reg.ind <- 1:15
pro.ind <- 1:99
com.ind <- 1:999
reg <- str_pad(reg.ind, width=2, side="left", pad="0")
prov <- str_pad(pro.ind, width=2, side="left", pad="0")
com  <- str_pad(com.ind, width=3, side="left", pad="0")

file <- c()
for(i in 1:length(reg)){
 reg.i <- reg[i]
    for(j in 1:length(prov)){
        prov.j <- prov[j]
            for(k in 1:length(com)){
                com.k <- com[k]
                file <- c(file, (paste0("A", reg.i, prov.j, com.k)))
            }
        }
    }

2) then I used another for loop to download a file everytime I hit a correct URL. I use tryCatchto ignore the cases when the URL was incorrect (most of the time)

for(i in 1:length(file)){
 tryCatch({
 url <- paste0("http://cdn.servel.cl/padronesauditados/padron/", file[i], 
 ".pdf")
# change destfile accordingly if you decide to run the code
download.file(url, destfile = paste0("./datos/comunas/",  file[i], ".pdf"), 
mode = "wb")
}, error = function(e){})
}

PROBLEM: In total I know there are not more than 400 pdf files, as each one of them correspond to a commune in Chile, but I wrote a vector with 1483515 possible file names, and therefore my code, even though it works, takes a much longer time than if I could manage to obtain the file names before hand.

Does anyone know how to workaround this problem?

Upvotes: 2

Views: 467

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78792

You can re-create the "browser developer tools" experience in R with splashr:

library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(tidyverse)

sp <- start_splash()

Sys.sleep(3) # give the docker container time to work

res <- render_har(url = "http://cdn.servel.cl/padronesauditados/padron.html", 
                  response_body=TRUE)

map_chr(har_entries(res), c("request", "url"))
##  [1] "http://cdn.servel.cl/padronesauditados/padron.html"
##  [2] "http://cdn.servel.cl/padronesauditados/stylesheets/navbar-cleaned.min.css"
##  [3] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue.min.css"
##  [4] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue2.min.css"
##  [5] "http://cdn.servel.cl/padronesauditados/stylesheets/custom.min.css"
##  [6] "https://fonts.googleapis.com/css?family=Lato%3A400%2C700%7CRoboto%3A100%2C300%2C400%2C500%2C700%2C900%2C100italic%2C300italic%2C400italic%2C500italic%2C700italic%2C900italic&ver=1458748651"
##  [7] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.css"
##  [8] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/external/jquery/jquery.js"
##  [9] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.js"
## [10] "http://cdn.servel.cl/padronesauditados/images/logo-txt-retina.png"
## [11] "http://cdn.servel.cl/assets/img/nav_arrows.png"
## [12] "http://cdn.servel.cl/padronesauditados/images/loader.gif"
## [13] "http://cdn.servel.cl/padronesauditados/archivos.xml"
## [14] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/images/ui-icons_444444_256x240.png"
## [15] "https://fonts.gstatic.com/s/roboto/v16/zN7GBFwfMP4uA6AR0HCoLQ.ttf"
## [16] "https://fonts.gstatic.com/s/roboto/v16/RxZJdnzeo3R5zSexge8UUaCWcynf_cDxXwCLxiixG1c.ttf"
## [17] "https://fonts.gstatic.com/s/roboto/v16/Hgo13k-tfSpn0qi1SFdUfaCWcynf_cDxXwCLxiixG1c.ttf"
## [18] "https://fonts.gstatic.com/s/roboto/v16/Jzo62I39jc0gQRrbndN6nfesZW2xOQ-xsNqO47m55DA.ttf"
## [19] "https://fonts.gstatic.com/s/roboto/v16/d-6IYplOFocCacKzxwXSOKCWcynf_cDxXwCLxiixG1c.ttf"
## [20] "https://fonts.gstatic.com/s/roboto/v16/mnpfi9pxYH-Go5UiibESIqCWcynf_cDxXwCLxiixG1c.ttf"
## [21] "http://cdn.servel.cl/padronesauditados/stylesheets/fonts/virtue_icons.woff"
## [22] "https://fonts.gstatic.com/s/lato/v13/v0SdcGFAl2aezM9Vq_aFTQ.ttf"
## [23] "https://fonts.gstatic.com/s/lato/v13/DvlFBScY1r-FMtZSYIYoYw.ttf"

Spotting the XML entry is easy in ^^, so we can focus on it:

har_entries(res)[[13]]$response$content$text %>% 
  openssl::base64_decode() %>% 
  xml2::read_xml() %>% 
  xml2::xml_find_all(".//Region") %>% 
  map_df(~{
    data_frame(
      id = xml2::xml_find_all(.x, ".//id") %>% xml2::xml_text(),
      nombre = xml2::xml_find_all(.x, ".//nombre") %>% xml2::xml_text(),
      nomcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/nomcomuna") %>% xml2::xml_text(),
      id_archivo = xml2::xml_find_all(.x, ".//comunas/comuna/idArchivo") %>% xml2::xml_text(),
      archcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/archcomuna") %>% xml2::xml_text()
    )
  })
## # A tibble: 346 x 5
##       id                       nombre     nomcomuna id_archivo   archcomuna
##    <chr>                        <chr>         <chr>      <chr>        <chr>
##  1     1 Región de Arica y Parinacota         Arica          1 A1501001.pdf
##  2     1 Región de Arica y Parinacota     Camarones          2 A1501002.pdf
##  3     1 Región de Arica y Parinacota General Lagos          3 A1502002.pdf
##  4     1 Región de Arica y Parinacota         Putre          4 A1502001.pdf
##  5     2           Región de Tarapacá Alto Hospicio          5 A0103002.pdf
##  6     2           Región de Tarapacá        Camiña          6 A0152002.pdf
##  7     2           Región de Tarapacá      Colchane          7 A0152003.pdf
##  8     2           Región de Tarapacá         Huara          8 A0152001.pdf
##  9     2           Región de Tarapacá       Iquique          9 A0103001.pdf
## 10     2           Región de Tarapacá          Pica         10 A0152004.pdf
## # ... with 336 more rows

stop_splash(sp) # don't forget to clean up!

You can then either programmatically download all the PDFs by using the URL prefix: http://cdn.servel.cl/padronesauditados/padron/

Upvotes: 1

Related Questions