Reputation: 51
I'm trying to use wget to elegantly & politely download all the pdfs from a website. The pdfs live in various sub-directories under the starting URL. It appears that the -A pdf option is conflicting with the -r option. But I'm not a wget expert! This command:
wget -nd -np -r site/path
faithfully traverses the entire site downloading everything downstream of path (not polite!). This command:
wget -nd -np -r -A pdf site/path
finishes immediately having downloaded nothing. Running that same command in debug mode:
wget -nd -np -r -A pdf -d site/path
reveals that the sub-directories are ignored with the debug message:
Deciding whether to enqueue "https://site/path/subdir1". https://site/path/subdir1 (subdir1) does not match acc/rej rules. Decided NOT to load it.
I think this means that the sub directories did not satisfy the "pdf" filter and were excluded. Is there a way to get wget to recurse into sub directories (of random depth) and only download pdfs (into a single local dir)? Or does wget need to download everything and then I need to manually filter for pdfs afterward?
UPDATE: thanks to everyone for their ideas. The solution was to use a two step approach including a modified version of this: http://mindspill.net/computing/linux-notes/generate-list-of-urls-using-wget/
Upvotes: 1
Views: 556
Reputation: 51
UPDATE: thanks to everyone for their ideas. The solution was to use a two step approach including a modified version of this: http://mindspill.net/computing/linux-notes/generate-list-of-urls-using-wget/
Upvotes: 2
Reputation: 11624
Try this
1)the “-l” switch specifies to wget to go one level down from the primary URL specified. You could obviously switch that to how ever many levels down in the links you want to follow.
wget -r -l1 -A.pdf http://www.example.com/page-with-pdfs.htm
refer man wget
for more details
if the above doesn't work,try this
verify that the TOS of the web site permit to crawl it. Then, one solution is :
mech-dump --links 'http://example.com' | grep pdf$ | sed 's/\s+/%20/g' | xargs -I% wget http://example.com/% The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros
for installing mech-dump
sudo apt-get update -y
sudo apt-get install -y libwww-mechanize-shell-perl
github repo https://github.com/libwww-perl/WWW-Mechanize
You will need to have wget and lynx installed:
sudo apt-get install wget lynx
Prepare a script name it however you want for this example pdflinkextractor
#!/bin/bash
WEBSITE="$1"
echo "Getting link list..."
lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee pdflinks.txt
echo "Downloading..."
wget -P pdflinkextractor_files/ -i pdflinks.txt
to run the file
chmod 700 pdfextractor
$ ./pdflinkextractor http://www.pdfscripting.com/public/Free-Sample-PDF-Files-with-scripts.cfm
Upvotes: 1