Jack Frost
Jack Frost

Reputation: 1

wget: downloading all files in directories/subdirectories

Basically on a webpage there is a list of directories, and each of these has further subdirectories. The subdirectories contain a number of files, and I want to download to a single location on my linux machine one file from each subdirectory which has the specific sequence letters 'RMD' in it.

E.g., say the main webpage links to directories dir1, dir2, dir3..., and each of these has subdirectories dir1a, dir1b..., dir2a, dir2b... etc. I want to download files of the form:

webpage/dir1/dir1a/file321RMD210
webpage/dir1/dir1b/file951RMD339
...
webpage/dir2/dir2a/file416RMD712
webpage/dir2/dir2b/file712RMD521

The directories/subdirectories are not sequentially numbered like in the above example (that was just me making it simpler to read) so is there a terminal command that will recursively go through each directory and subdirectory and download every file with the letters 'RMD' in the file name?

The website in question is: here

I hope that's enough information.

Upvotes: 0

Views: 7814

Answers (3)

Gilles Quénot
Gilles Quénot

Reputation: 185690

One solution using saxon-lint :

saxon-lint --html --xpath 'string-join(//a/@href, "^M")' http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ | awk '/SOL/{print "http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/"$0}' | while read url; do saxon-lint --html --xpath 'string-join(//a/@href, "^M")' "$url" | awk -vurl="$url" '/SOL/{print url$0}'; done | while read url2; do saxon-lint --html --xpath 'string-join(//a/@href, "^M")' "$url2" | awk -vurl2="$url2" '/RME/{print url2$0}'; done | xargs wget

Edit the

"^M"

by control+M (Unix) or \r\n for windows

Upvotes: 2

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477607

An answer with a lot of remarks:

In case the website supports ftp, you better use @MichaelBaldry's answer. This answer aims to give a way to do it with wget (but this is less efficient for both server and client).

Only in case the website works with a directory listing, you can use the -r flag for this (the -R flag aims to find links in webpages and then downloads these pages as well).

The following method is inefficient for both server and client and can result in a huge load if the pages are generated dynamically. The website you mention furthermore specifically asks not to fetch data that way.

wget  -e robots=off -r -k -nv -nH -l inf -R jpg,jpeg,gif,png,tif --reject-regex '(.*)\?(.*)'  --no-parent 'http://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/'

with:

  • wget the program you want to call;
  • -e robots=off; the fact that you ignore the websites request not to download this automatically;
  • -r: you download recursively;
  • -R jpg,jpeg,gif,png,tif: reject the downloading of media (the small images);
  • --reject-regex '(.*)\?(.*)' do not follow or download query pages (sorting of index pages).
  • -l inf: that you keep downloading for an infinite level
  • --no-parent: prevent wget from starting to fetch links in the parent of the website (for instance the .. links to the parent directory).


wget downloads the files breadth-first so you will have to wait a long time before it eventually starts fetching the real data files.


Note that wget has no means to guess the directory structure at server-side. It only aims to find links in the fetched pages and thus with this knowledge aims to generate a dump of "visible" files. It is possible that the webserver does not list all available files, and thus wget will fail to download all files.

Upvotes: 3

Michael Baldry
Michael Baldry

Reputation: 2028

I've noticed this site supports FTP protocol, which is a far more convenient way of reading files and folders. (Its for transferring files, not web pages)

Get a FTP client (lots of them about) and open ftp://atmos.nmsu.edu/PDS/data/mslrem_1001/DATA/ you can probably just highlight all the folders in there and hit download.

Upvotes: 2

Related Questions