user1387866
user1387866

Reputation: 3004

cannot get 'wget --recursive' to work

I would like to download this page:

https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset

as well as its subpages, especially the .pdf documents:

https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset/MS-A0210_thursday_30_oct.pdf
https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset/MS-A0210_hints_for_w45.pdf
etc.

When I give this command:

$ wget --page-requisites --convert-links --recursive --level=0 --no-check-certificate --no-proxy -E -H -Dnoppa.aalto.fi -k https://noppa.aalto.fi/noppa/kurssi/ms-a0210/viikkoharjoitukset

I get:

$ ls -R
.:
noppa.aalto.fi

./noppa.aalto.fi:
noppa  robots.txt

./noppa.aalto.fi/noppa:
kurssi

./noppa.aalto.fi/noppa/kurssi:
ms-a0210

./noppa.aalto.fi/noppa/kurssi/ms-a0210:
viikkoharjoitukset.html

I have tried several wget options, with no luck.

What could be the problem?

Upvotes: 1

Views: 156

Answers (1)

zb226
zb226

Reputation: 10519

By default, wget adheres to robots.txt files, which, in this case, disallows all access:

User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /
Disallow: /cgi-bin/

If you add -e robots=off to your command line, wget will not care for a robots.txt file.

Upvotes: 1

Related Questions