sicsmpr
sicsmpr

Reputation: 51

wget recursion and file extraction

I'm trying to use wget to elegantly & politely download all the pdfs from a website. The pdfs live in various sub-directories under the starting URL. It appears that the -A pdf option is conflicting with the -r option. But I'm not a wget expert! This command:

wget -nd -np -r site/path

faithfully traverses the entire site downloading everything downstream of path (not polite!). This command:

wget -nd -np -r -A pdf site/path

finishes immediately having downloaded nothing. Running that same command in debug mode:

wget -nd -np -r -A pdf -d site/path

reveals that the sub-directories are ignored with the debug message:

Deciding whether to enqueue "https://site/path/subdir1". https://site/path/subdir1 (subdir1) does not match acc/rej rules. Decided NOT to load it.

I think this means that the sub directories did not satisfy the "pdf" filter and were excluded. Is there a way to get wget to recurse into sub directories (of random depth) and only download pdfs (into a single local dir)? Or does wget need to download everything and then I need to manually filter for pdfs afterward?

UPDATE: thanks to everyone for their ideas. The solution was to use a two step approach including a modified version of this: http://mindspill.net/computing/linux-notes/generate-list-of-urls-using-wget/

Upvotes: 1

Views: 556

Answers (2)

sicsmpr
sicsmpr

Reputation: 51

UPDATE: thanks to everyone for their ideas. The solution was to use a two step approach including a modified version of this: http://mindspill.net/computing/linux-notes/generate-list-of-urls-using-wget/

Upvotes: 2

Jatin Mehrotra
Jatin Mehrotra

Reputation: 11624

Try this

1)the “-l” switch specifies to wget to go one level down from the primary URL specified. You could obviously switch that to how ever many levels down in the links you want to follow.

wget -r -l1 -A.pdf http://www.example.com/page-with-pdfs.htm

refer man wget for more details

  1. if the above doesn't work,try this

    verify that the TOS of the web site permit to crawl it. Then, one solution is :

    mech-dump --links 'http://example.com' | grep pdf$ | sed 's/\s+/%20/g' | xargs -I% wget http://example.com/% The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros

for installing mech-dump

sudo apt-get update -y
sudo apt-get install -y libwww-mechanize-shell-perl
 

github repo https://github.com/libwww-perl/WWW-Mechanize

  1. I haven't tested this, but you cans still give a try, what i think is you still need to find a way to get all URLs of a website and pipe to any of the solutions I have given.

You will need to have wget and lynx installed:

sudo apt-get install wget lynx

Prepare a script name it however you want for this example pdflinkextractor

    #!/bin/bash
    
    WEBSITE="$1"
    
    echo "Getting link list..."
    
    lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee pdflinks.txt
    
       
    
    echo "Downloading..."    
    wget -P pdflinkextractor_files/ -i pdflinks.txt

to run the file

chmod 700 pdfextractor
$  ./pdflinkextractor http://www.pdfscripting.com/public/Free-Sample-PDF-Files-with-scripts.cfm

Upvotes: 1

Related Questions