Reputation: 6039
The following did not work.
wget -r -A .pdf home_page_url
It stop with the following message:
....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED
I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.
Any other way to recursively download all pdf files in an website. ?
Upvotes: 8
Views: 15571
Reputation: 33
In my version of wget (GNU Wget 1.21.3), the -A
/--accept
and -r
/--recursive
flags don't play nicely with each other.
Here's my script for scraping a domain for PDFs (or any other filetype):
wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
[[ $line == *'200 OK' ]] || continue
[[ $line == *'.pdf'* ]] || continue
echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done
Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read
block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.
Upvotes: 0
Reputation: 4338
This is certainly because of the links in the HTML don't end up with /.
Wget will not follow this has it think it's a file (but doesn't match your filter):
<a href="link">page</a>
But will follow this:
<a href="link/">page</a>
You can use the --debug
option to see if it's the actual problem.
I don't know any good solution for this. In my opinion this is a bug.
Upvotes: 0
Reputation: 65
the following cmd works for me, it will download pictures of a site
wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/
Upvotes: 1
Reputation: 143
It may be based on a robots.txt. Try adding -e robots=off
.
Other possible problems are cookie based authentication or agent rejection for wget. See these examples.
EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at
Upvotes: 1