ponyhd
ponyhd

Reputation: 521

batch download zipped files in R

I am trying to download zipped files from website like http://cdo.ncdc.noaa.gov/qclcd_ascii/. Since there are many files, is there a way to download them in batch instead of one by one? Ideally, the downloaded files can be unzipped in batch after downloading. I tried to use system(curl http://cdo.ncdc.noaa.gov/qclcd_ascii/QCLCD") etc.. but got many errors and status 127 warnings.

Any idea or suggestions?

Thanks!

Upvotes: 2

Views: 1909

Answers (4)

Andreas
Andreas

Reputation: 1953

Here's my take on it:

### Load XML package, for 'htmlParse'
require(XML)

### Read in HTML contents, extract file names.
root <- 'http://cdo.ncdc.noaa.gov/qclcd_ascii/'
doc  <- htmlParse(root)
fnames <- xpathSApply(doc, '//a[@href]', xmlValue)

### Keep only zip files, and create url paths to scrape.
fnames <- grep('zip$', fnames, value = T)
paths  <- paste0(root, fnames)

Now that you have a vector of url's and corresponding file-name's in R, you can download them to your hard disk. You have two options. You can download in serial, or in parallel.

### Download data in serial, saving to the current working directory.
mapply(download.file, url = paths, destfile = fnames)

### Download data in parallel, also saving to current working directory.
require(parallel)
cl <- makeCluster(detectCores())
clusterMap(cl, download.file, url = paths, destfile = fnames, 
           .scheduling = 'dynamic')

If you choose to download in parallel, I recommend considering 'dynamic' scheduling, which means that each core won't have to wait for others to finish before starting its next download. The downside to dynamic scheduling is the added communication overhead, but since the process of downloading ~50mb files is not very resource intensive, it will be worth it to use this option so long as files download at slightly varying speeds.

Lastly, if you want to also include tar files as well, change the regular expression to

fnames <- grep('(zip)|(gz)$', fnames, value = T)

Upvotes: 1

mgriebe
mgriebe

Reputation: 908

This should work.

library(XML)
url<-c("http://cdo.ncdc.noaa.gov/qclcd_ascii/")
doc<-htmlParse(url)
#get <a> nodes.
Anodes<-getNodeSet(doc,"//a")
#get the ones with .zip's and .gz's
files<-grep("*.gz|*.zip",sapply(Anodes, function(Anode) xmlGetAttr(Anode,"href")),value=TRUE)
#make the full url
urls<-paste(url,files,sep="")
#Download each file.
mapply(function(x,y) download.file(x,y),urls,files)

Upvotes: 2

Phil Parsons
Phil Parsons

Reputation: 36

To download everything under that directory you can do this:

wget -r -e robots=off http://cdo.ncdc.noaa.gov/qclcd_ascii/

Upvotes: 0

C. Bergey
C. Bergey

Reputation: 11

It's not R, but you could easily use the program wget, ignoring robots.txt:

wget -r --no-parent -e robots=off --accept *.gz http://cdo.ncdc.noaa.gov/qclcd_ascii/

Upvotes: 1

Related Questions