Reputation: 2513
I am downloading xls files from this page with a loop through url's with R (based on this first step):
getURLFilename <- function(url){
require(stringi)
hdr <-paste(curlGetHeaders(url),collapse = '')
fname <- as.vector(stri_match(hdr,regex = '(?<=filename=\\").*(?=\\")'))
fname
}
for(i in 8:56) {
i1 <- sprintf('%02d', i)
url <- paste0("https://journals.openedition.org/acrh/29", i1, "?file=1")
file <- paste0("myExcel_", i, ".xls")
if (!file.exists(file)) download.file(url, file)
}
The files are downloaded but corrupted.
Upvotes: 0
Views: 155
Reputation: 24272
You should use mode="wb"
in download.file
to write the file in binary mode.
library(readxl)
for (i in 8:55) {
i1 <- sprintf('%02d', i)
url <- paste0("https://journals.openedition.org/acrh/29", i1, "?file=1")
if (is.na(format_from_signature(url))) {
file <- paste0("myPdf_", i, ".pdf")
} else {
file <- paste0("myExcel_", i, ".xls")
}
if (!file.exists(file)) download.file(url, file, mode="wb")
}
Now the downloaded Excel files are not corrupted.
Upvotes: 2
Reputation: 389325
Here is a bit different approach using rvest
to scrape the URL's to download and filename to save only the XLS files and not the PDF's.
library(rvest)
url <- "https://journals.openedition.org/acrh/2906"
#Scrape the nodes which we are interested in
target_nodes <- url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="annexes"]') %>%
html_nodes("a")
#Get the indices which ends with xls
inds <- target_nodes %>% html_text() %>% grep("xls$", .)
#Get the corresponding URL for the xls files and paste it with prefix
target_urls <- target_nodes %>%
html_attr("href") %>% .[inds] %>%
paste0("https://journals.openedition.org/acrh/", .)
#Get the target name to save file
target_name <- target_nodes %>%
html_text() %>%
grep("xls$", ., value = TRUE) %>%
sub("\\s+", ".", .) %>%
paste0("/folder_path/to/storefiles/", .)
#Download the files and store them at target_name location
mapply(download.file, target_urls, target_name)
I manually verified 3-4 files on my system and I am able to open them and the data match as well when I manually download them from the url
.
Upvotes: 2