Wilcar
Wilcar

Reputation: 2513

Downloading xls files with a loop through url's gives me corrupted files

I am downloading xls files from this page with a loop through url's with R (based on this first step):

getURLFilename <- function(url){
  require(stringi)
  hdr <-paste(curlGetHeaders(url),collapse = '')
  fname <- as.vector(stri_match(hdr,regex = '(?<=filename=\\").*(?=\\")'))
  fname
}


for(i in 8:56) {
  i1 <- sprintf('%02d', i)
  url <- paste0("https://journals.openedition.org/acrh/29", i1, "?file=1")
  file <- paste0("myExcel_", i, ".xls")
  if (!file.exists(file)) download.file(url, file) 
 }

The files are downloaded but corrupted.

Upvotes: 0

Views: 155

Answers (2)

Marco Sandri
Marco Sandri

Reputation: 24272

You should use mode="wb" in download.file to write the file in binary mode.

library(readxl)
for (i in 8:55) {
  i1 <- sprintf('%02d', i)
  url <- paste0("https://journals.openedition.org/acrh/29", i1, "?file=1")
  if (is.na(format_from_signature(url))) {
    file <- paste0("myPdf_", i, ".pdf")
  } else {
    file <- paste0("myExcel_", i, ".xls")
  }
  if (!file.exists(file)) download.file(url, file, mode="wb") 
}

Now the downloaded Excel files are not corrupted.

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389325

Here is a bit different approach using rvest to scrape the URL's to download and filename to save only the XLS files and not the PDF's.

library(rvest)
url <- "https://journals.openedition.org/acrh/2906"

#Scrape the nodes which we are interested in 
target_nodes <- url %>%
                  read_html() %>%
                  html_nodes(xpath = '//*[@id="annexes"]') %>%
                  html_nodes("a")

#Get the indices which ends with xls
inds <- target_nodes %>% html_text() %>% grep("xls$", .)

#Get the corresponding URL for the xls files and paste it with prefix
target_urls <- target_nodes %>% 
                    html_attr("href") %>% .[inds] %>% 
                    paste0("https://journals.openedition.org/acrh/", .)

#Get the target name to save file
target_name <- target_nodes %>% 
                    html_text() %>% 
                    grep("xls$", ., value = TRUE) %>% 
                    sub("\\s+", ".", .) %>% 
                    paste0("/folder_path/to/storefiles/", .)

#Download the files and store them at target_name location
mapply(download.file, target_urls, target_name)

I manually verified 3-4 files on my system and I am able to open them and the data match as well when I manually download them from the url.

Upvotes: 2

Related Questions