Roland
Roland

Reputation: 449

wget: using wildcards in the middle of the path

I am trying to recursively download .nc files from: https://satdat.ngdc.noaa.gov/sem/goes/data/full/*/*/*/netcdf/*.nc

A target link looks like this one:

https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/netcdf/

and I need to exclude this:

https://satdat.ngdc.noaa.gov/sem/goes/data/full/1992/11/goes07/csv/

I do not understand how to use wildcards for defining path in wget.

Also, the following command (a test for year 1981 only), only downloads subfolders 10, 11 and 12, failing with {01..09} subfolders:

for i in {01..12};do wget -r -nH -np -x --force-directories -e robots=off https://satdat.ngdc.noaa.gov/sem/goes/data/full/1981/${i}/goes02/netcdf/; done

Upvotes: 1

Views: 484

Answers (1)

Daweo
Daweo

Reputation: 36360

I do not understand how to use wildcards for defining path in wget.

According to GNU Wget manual

File name wildcard matching and recursive mirroring of directories are available when retrieving via FTP.

so you must not use one in URL provided when working with HTTP or HTTPS server.

You might combine -r with --accept-regex urlregex to

Specify a regular expression to accept(...)the complete URL.

Observe that it should match whole URL, for example if I wish pages linked in GNU Package blurbs which contain auto in path I could do that by

wget -r --level=1 --accept-regex '.*auto.*' https://www.gnu.org/manual/blurbs.html

which result in download main pages of autoconf, autoconf-archive, autogen, automake. Note: --level=1 is used to prevent going further down than links shown in blurbs.

Upvotes: 3

Related Questions