Reputation: 624
From this link, I´m trying to download multiple pdf files, but I can´t get the exact URL for each file.
To access one of the pdf files, you could click on "Región de Arica y Parinacota" and then click on "Arica". Then, you can check that the url is, if you click on the next link "Camarones" you now noticed that the URL is
I checked more URLs, and they all have a similar pattern:
"A" + "two digit number from 1 to 15" + "two digit number of unknown range" + "three digit number of unknown range"
Even though the URL examples I showed seem to suggest that the file names are named sequentally, this is not always the case.
What I did to be able to download all the files despite not knowing the exact URLs I did the following:
1) I made a for loop in order to write all possible file names based on the pattern I describe above, i.e, A0101001.pdf, A0101002.pdf....A1599999.pdf
reg.ind <- 1:15
pro.ind <- 1:99
com.ind <- 1:999
reg <- str_pad(reg.ind, width=2, side="left", pad="0")
prov <- str_pad(pro.ind, width=2, side="left", pad="0")
com <- str_pad(com.ind, width=3, side="left", pad="0")
file <- c()
for(i in 1:length(reg)){
reg.i <- reg[i]
for(j in 1:length(prov)){
prov.j <- prov[j]
for(k in 1:length(com)){
com.k <- com[k]
file <- c(file, (paste0("A", reg.i, prov.j, com.k)))
2) then I used another for loop to download a file everytime I hit a correct URL. I use tryCatch
to ignore the cases when the URL was incorrect (most of the time)
for(i in 1:length(file)){
url <- paste0("", file[i],
# change destfile accordingly if you decide to run the code
download.file(url, destfile = paste0("./datos/comunas/", file[i], ".pdf"),
mode = "wb")
}, error = function(e){})
PROBLEM: In total I know there are not more than 400 pdf files, as each one of them correspond to a commune in Chile, but I wrote a vector with 1483515 possible file names, and therefore my code, even though it works, takes a much longer time than if I could manage to obtain the file names before hand.
Does anyone know how to workaround this problem?
Upvotes: 2
Views: 467
Reputation: 78792
You can re-create the "browser developer tools" experience in R with splashr
library(splashr) # devtools::install_github("hrbrmstr/splashr")
sp <- start_splash()
Sys.sleep(3) # give the docker container time to work
res <- render_har(url = "",
map_chr(har_entries(res), c("request", "url"))
## [1] ""
## [2] ""
## [3] ""
## [4] ""
## [5] ""
## [6] ""
## [7] ""
## [8] ""
## [9] ""
## [10] ""
## [11] ""
## [12] ""
## [13] ""
## [14] ""
## [15] ""
## [16] ""
## [17] ""
## [18] ""
## [19] ""
## [20] ""
## [21] ""
## [22] ""
## [23] ""
Spotting the XML entry is easy in ^^, so we can focus on it:
har_entries(res)[[13]]$response$content$text %>%
openssl::base64_decode() %>%
xml2::read_xml() %>%
xml2::xml_find_all(".//Region") %>%
id = xml2::xml_find_all(.x, ".//id") %>% xml2::xml_text(),
nombre = xml2::xml_find_all(.x, ".//nombre") %>% xml2::xml_text(),
nomcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/nomcomuna") %>% xml2::xml_text(),
id_archivo = xml2::xml_find_all(.x, ".//comunas/comuna/idArchivo") %>% xml2::xml_text(),
archcomuna = xml2::xml_find_all(.x, ".//comunas/comuna/archcomuna") %>% xml2::xml_text()
## # A tibble: 346 x 5
## id nombre nomcomuna id_archivo archcomuna
## <chr> <chr> <chr> <chr> <chr>
## 1 1 Región de Arica y Parinacota Arica 1 A1501001.pdf
## 2 1 Región de Arica y Parinacota Camarones 2 A1501002.pdf
## 3 1 Región de Arica y Parinacota General Lagos 3 A1502002.pdf
## 4 1 Región de Arica y Parinacota Putre 4 A1502001.pdf
## 5 2 Región de Tarapacá Alto Hospicio 5 A0103002.pdf
## 6 2 Región de Tarapacá Camiña 6 A0152002.pdf
## 7 2 Región de Tarapacá Colchane 7 A0152003.pdf
## 8 2 Región de Tarapacá Huara 8 A0152001.pdf
## 9 2 Región de Tarapacá Iquique 9 A0103001.pdf
## 10 2 Región de Tarapacá Pica 10 A0152004.pdf
## # ... with 336 more rows
stop_splash(sp) # don't forget to clean up!
You can then either programmatically download all the PDFs by using the URL prefix:
Upvotes: 1