Reputation: 521
I am trying to download zipped files from website like http://cdo.ncdc.noaa.gov/qclcd_ascii/. Since there are many files, is there a way to download them in batch instead of one by one? Ideally, the downloaded files can be unzipped in batch after downloading. I tried to use system(curl http://cdo.ncdc.noaa.gov/qclcd_ascii/QCLCD") etc.. but got many errors and status 127 warnings.
Any idea or suggestions?
Thanks!
Upvotes: 2
Views: 1909
Reputation: 1953
Here's my take on it:
### Load XML package, for 'htmlParse'
require(XML)
### Read in HTML contents, extract file names.
root <- 'http://cdo.ncdc.noaa.gov/qclcd_ascii/'
doc <- htmlParse(root)
fnames <- xpathSApply(doc, '//a[@href]', xmlValue)
### Keep only zip files, and create url paths to scrape.
fnames <- grep('zip$', fnames, value = T)
paths <- paste0(root, fnames)
Now that you have a vector of url's and corresponding file-name's in R, you can download them to your hard disk. You have two options. You can download in serial, or in parallel.
### Download data in serial, saving to the current working directory.
mapply(download.file, url = paths, destfile = fnames)
### Download data in parallel, also saving to current working directory.
require(parallel)
cl <- makeCluster(detectCores())
clusterMap(cl, download.file, url = paths, destfile = fnames,
.scheduling = 'dynamic')
If you choose to download in parallel, I recommend considering 'dynamic' scheduling, which means that each core won't have to wait for others to finish before starting its next download. The downside to dynamic scheduling is the added communication overhead, but since the process of downloading ~50mb files is not very resource intensive, it will be worth it to use this option so long as files download at slightly varying speeds.
Lastly, if you want to also include tar
files as well, change the regular expression to
fnames <- grep('(zip)|(gz)$', fnames, value = T)
Upvotes: 1
Reputation: 908
This should work.
library(XML)
url<-c("http://cdo.ncdc.noaa.gov/qclcd_ascii/")
doc<-htmlParse(url)
#get <a> nodes.
Anodes<-getNodeSet(doc,"//a")
#get the ones with .zip's and .gz's
files<-grep("*.gz|*.zip",sapply(Anodes, function(Anode) xmlGetAttr(Anode,"href")),value=TRUE)
#make the full url
urls<-paste(url,files,sep="")
#Download each file.
mapply(function(x,y) download.file(x,y),urls,files)
Upvotes: 2
Reputation: 36
To download everything under that directory you can do this:
wget -r -e robots=off http://cdo.ncdc.noaa.gov/qclcd_ascii/
Upvotes: 0
Reputation: 11
It's not R, but you could easily use the program wget, ignoring robots.txt:
wget -r --no-parent -e robots=off --accept *.gz http://cdo.ncdc.noaa.gov/qclcd_ascii/
Upvotes: 1