Neil
Neil

Reputation: 6039

Download all files of a particular type from a website using wget stops in the starting url

The following did not work.

wget -r -A .pdf home_page_url

It stop with the following message:

....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED

I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.

Any other way to recursively download all pdf files in an website. ?

Upvotes: 8

Views: 15571

Answers (4)

alipman88
alipman88

Reputation: 33

In my version of wget (GNU Wget 1.21.3), the -A/--accept and -r/--recursive flags don't play nicely with each other.

Here's my script for scraping a domain for PDFs (or any other filetype):


wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
  [[ $line == *'200 OK' ]] || continue
  [[ $line == *'.pdf'* ]] || continue
  echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done

Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.

Upvotes: 0

gagarine
gagarine

Reputation: 4338

This is certainly because of the links in the HTML don't end up with /.

Wget will not follow this has it think it's a file (but doesn't match your filter):

<a href="link">page</a>

But will follow this:

<a href="link/">page</a>

You can use the --debug option to see if it's the actual problem.

I don't know any good solution for this. In my opinion this is a bug.

Upvotes: 0

telehan
telehan

Reputation: 65

the following cmd works for me, it will download pictures of a site

wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/

Upvotes: 1

rimrul
rimrul

Reputation: 143

It may be based on a robots.txt. Try adding -e robots=off.

Other possible problems are cookie based authentication or agent rejection for wget. See these examples.

EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at

Upvotes: 1

Related Questions