Jim
Jim

Reputation: 53

Wget spider a website to collect all links

I'm trying to spider this website to depth=2 and collect all the links (urls). A simple task but it seems to be impossible and I must be missing something? I get no urls just an empty text file. Here is the latest command I'm using (messy I know):

wget --spider --force-html --span-hosts --user-agent="Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0" -np --limit-rate=20k -e robots=off --wait=3 --random-wait -r -l2 https://en.wikibooks.org/wiki/C%2B%2B_Programming 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '.(css\|js\|png\|gif\|jpg)$' | sort | uniq > urls.txt

Any ideas?

Upvotes: 4

Views: 5988

Answers (1)

Prateek Paranjpe
Prateek Paranjpe

Reputation: 543

I would suggest you do it in 2 steps, better readability and less clutter.

  1. Do the spidering and get the output in a log file.
  2. Parse the log file to get the URLs that you are looking for.

For #1 -

wget --spider --force-html --span-hosts --user-agent="Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0" -np --limit-rate=20k -e robots=off --wait=3 --random-wait -r -l2 https://en.wikibooks.org/wiki/C%2B%2B_Programming -o wget.log &

Once #1 is done, you can go for #2.

For #2 -

grep http wget.log | grep -v "[following]" | awk '{print $3}' | grep -vE ".css|.js|.png|.gif|.jpg" | sort -u > urls.txt

This will give you what you are looking for.

Note, #1 will download everything it finds and since you are going 2 levels deep, that might be a whole lot of data. You could use the "--delete-after" option with wget if you dont want to download everything (i.e., if you are planning to use the urls.txt to download stuff)

Upvotes: 6

Related Questions