Crawl website using wget and limit total number of crawled links

Question

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links?

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

Wolph · Accepted Answer

You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/

Crawl website using wget and limit total number of crawled links

Answers (2)

Related Questions