GobiasKoffi
GobiasKoffi

Reputation: 4084

Crawl website using wget and limit total number of crawled links

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links?

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

Upvotes: 3

Views: 5932

Answers (2)

Olivier Delouya
Olivier Delouya

Reputation: 1

  1. Create a fifo file (mknod /tmp/httpipe p)
  2. do a fork
    • in the child do wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
    • in the father: read line by line /tmp/httpipe
    • parse the output =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
    • count the lines; after 100 lines just close the file, it will break the pipe

Upvotes: 0

Wolph
Wolph

Reputation: 80031

You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/

Upvotes: 2

Related Questions