Reputation: 3004
I would like to download this page:
https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset
as well as its subpages, especially the .pdf documents:
https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset/MS-A0210_thursday_30_oct.pdf
https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset/MS-A0210_hints_for_w45.pdf
etc.
When I give this command:
$ wget --page-requisites --convert-links --recursive --level=0 --no-check-certificate --no-proxy -E -H -Dnoppa.aalto.fi -k https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset
I get:
$ ls -R
.:
noppa.aalto.fi
./noppa.aalto.fi:
noppa robots.txt
./noppa.aalto.fi/noppa:
kurssi
./noppa.aalto.fi/noppa/kurssi:
ms-a0210
./noppa.aalto.fi/noppa/kurssi/ms-a0210:
viikkoharjoitukset.html
I have tried several wget options, with no luck.
What could be the problem?
Upvotes: 1
Views: 156
Reputation: 10519
By default, wget
adheres to robots.txt
files, which, in this case, disallows all access:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Disallow: /cgi-bin/
If you add -e robots=off
to your command line, wget
will not care for a robots.txt
file.
Upvotes: 1