Reputation: 624
From this link, I´m trying to download multiple pdf files, but I can´t get the exact URL for each file.
To access one of the pdf files, you could click on "Región de Arica y Parinacota" and then click on "Arica". Then, you can check that the url is http://cdn.servel.cl/padronesauditados/padron/A1501001.pdf, if you click on the next link "Camarones" you now noticed that the URL is http://cdn.servel.cl/padronesauditados/padron/A1501002.pdf
I checked more URLs, and they all have a similar pattern:
"A" + "two digit number from 1 to 15" + "two digit number of unknown range" + "three digit number of unknown range"
Even though the URL examples I showed seem to suggest that the file names are named sequentally, this is not always the case.
What I did to be able to download all the files despite not knowing the exact URLs I did the following:
1) I made a for loop in order to write all possible file names based on the pattern I describe above, i.e, A0101001.pdf, A0101002.pdf....A1599999.pdf
library(downloader)
library(stringr)
reg.ind <- 1:15
pro.ind <- 1:99
com.ind <- 1:999
reg <- str_pad(reg.ind, width=2, side="left", pad="0")
prov <- str_pad(pro.ind, width=2, side="left", pad="0")
com <- str_pad(com.ind, width=3, side="left", pad="0")
file <- c()
for(i in 1:length(reg)){
reg.i <- reg[i]
for(j in 1:length(prov)){
prov.j <- prov[j]
for(k in 1:length(com)){
com.k <- com[k]
file <- c(file, (paste0("A", reg.i, prov.j, com.k)))
}
}
}
2) then I used another for loop to download a file everytime I hit a correct URL. I use tryCatch
to ignore the cases when the URL was incorrect (most of the time)
for(i in 1:length(file)){
tryCatch({
url <- paste0("http://cdn.servel.cl/padronesauditados/padron/", file[i],
".pdf")
# change destfile accordingly if you decide to run the code
download.file(url, destfile = paste0("./datos/comunas/", file[i], ".pdf"),
mode = "wb")
}, error = function(e){})
}
PROBLEM: In total I know there are not more than 400 pdf files, as each one of them correspond to a commune in Chile, but I wrote a vector with 1483515 possible file names, and therefore my code, even though it works, takes a much longer time than if I could manage to obtain the file names before hand.
Does anyone know how to workaround this problem?
Upvotes: 2
Views: 467
Reputation: 78792
You can re-create the "browser developer tools" experience in R with splashr
:
library(splashr) # devtools::install_github("hrbrmstr/splashr")
library(tidyverse)
sp <- start_splash()
Sys.sleep(3) # give the docker container time to work
res <- render_har(url = "http://cdn.servel.cl/padronesauditados/padron.html",
response_body=TRUE)
map_chr(har_entries(res), c("request", "url"))
## [1] "http://cdn.servel.cl/padronesauditados/padron.html"
## [2] "http://cdn.servel.cl/padronesauditados/stylesheets/navbar-cleaned.min.css"
## [3] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue.min.css"
## [4] "http://cdn.servel.cl/padronesauditados/stylesheets/virtue2.min.css"
## [5] "http://cdn.servel.cl/padronesauditados/stylesheets/custom.min.css"
## [6] "https://fonts.googleapis.com/css?family=Lato%3A400%2C700%7CRoboto%3A100%2C300%2C400%2C500%2C700%2C900%2C100italic%2C300italic%2C400italic%2C500italic%2C700italic%2C900italic&ver=1458748651"
## [7] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.css"
## [8] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/external/jquery/jquery.js"
## [9] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/jquery-ui.js"
## [10] "http://cdn.servel.cl/padronesauditados/images/logo-txt-retina.png"
## [11] "http://cdn.servel.cl/assets/img/nav_arrows.png"
## [12] "http://cdn.servel.cl/padronesauditados/images/loader.gif"
## [13] "http://cdn.servel.cl/padronesauditados/archivos.xml"
## [14] "http://cdn.servel.cl/padronesauditados/jquery-ui-1.12.1.custom/images/ui-icons_444444_256x240.png"
## [15] "https://fonts.gstatic.com/s/roboto/v16/zN7GBFwfMP4uA6AR0HCoLQ.ttf"
## [16] "https://fonts.gstatic.com/s/roboto/v16/RxZJdnzeo3R5zSexge8UUaCWcynf_cDxXwCLxiixG1c.ttf"
## [17] "https://fonts.gstatic.com/s/roboto/v16/Hgo13k-tfSpn0qi1SFdUfaCWcynf_cDxXwCLxiixG1c.ttf"
## [18] "https://fonts.gstatic.com/s/roboto/v16/Jzo62I39jc0gQRrbndN6nfesZW2xOQ-xsNqO47m55DA.ttf"
## [19] "https://fonts.gstatic.com/s/roboto/v16/d-6IYplOFocCacKzxwXSOKCWcynf_cDxXwCLxiixG1c.ttf"
## [20] "https://fonts.gstatic.com/s/roboto/v16/mnpfi9pxYH-Go5UiibESIqCWcynf_cDxXwCLxiixG1c.ttf"
## [21] "http://cdn.servel.cl/padronesauditados/stylesheets/fonts/virtue_icons.woff"
## [22] "https://fonts.gstatic.com/s/lato/v13/v0SdcGFAl2aezM9Vq_aFTQ.ttf"
## [23] "https://fonts.gstatic.com/s/lato/v13/DvlFBScY1r-FMtZSYIYoYw.ttf"
Spotting the XML entry is easy in ^^, so we can focus on it:
har_entries(res)[[13]]$response$content$text %>%
openssl::base64_decode() %>%
xml2::read_xml() %>%
xml2::xml_find_all(".//Region") %>%
map_df(~{
data_frame(
id = xml2::xml_find_all(.x, ".//id") %>% xml2::xml_text(),
nombre = xml2::xml_find_all(.x, ".//nombre") %>% xml2::xml_text(),
nomcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/nomcomuna") %>% xml2::xml_text(),
id_archivo = xml2::xml_find_all(.x, ".//comunas/comuna/idArchivo") %>% xml2::xml_text(),
archcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/archcomuna") %>% xml2::xml_text()
)
})
## # A tibble: 346 x 5
## id nombre nomcomuna id_archivo archcomuna
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Región de Arica y Parinacota Arica 1 A1501001.pdf
## 2 1 Región de Arica y Parinacota Camarones 2 A1501002.pdf
## 3 1 Región de Arica y Parinacota General Lagos 3 A1502002.pdf
## 4 1 Región de Arica y Parinacota Putre 4 A1502001.pdf
## 5 2 Región de Tarapacá Alto Hospicio 5 A0103002.pdf
## 6 2 Región de Tarapacá Camiña 6 A0152002.pdf
## 7 2 Región de Tarapacá Colchane 7 A0152003.pdf
## 8 2 Región de Tarapacá Huara 8 A0152001.pdf
## 9 2 Región de Tarapacá Iquique 9 A0103001.pdf
## 10 2 Región de Tarapacá Pica 10 A0152004.pdf
## # ... with 336 more rows
stop_splash(sp) # don't forget to clean up!
You can then either programmatically download all the PDFs by using the URL prefix: http://cdn.servel.cl/padronesauditados/padron/
Upvotes: 1