Reputation: 1636
I've been collecting/crawling a website over the last two weeks. I've used the crawl
command setting 100
iterations. The process has just finished. How can I know the coverage of the data crawled? I really don't expect an exact number, but I'd really like to know approximately how much information remains un-crawled in the website.
Upvotes: 1
Views: 326
Reputation: 1636
Thanks, @Jorge. Based on what you've said:
Nutch has no idea of how big/small is the website(s) you're crawling
So, there's no way to calculate that unless you know the size of the website in advance.
Thanks, again.
Upvotes: 0
Reputation: 3253
You question is a bit ambiguous, if you're trying to get how much data of the entire website you've already crawled this is a hard problem, Nutch has no idea of how big/small is the website(s) you're crawling. You said that you have done 100 iterations, using default settings in the bin/crawl
script this means that on each iteration Nutch it is fetching a maximum of 50 000 URLs (https://github.com/apache/nutch/blob/master/src/bin/crawl#L117), but this doesn't mean that your website doesn't have more URLs, just means that this is a configuration on Nutch, and perhaps Nutch haven't even discovered all the URLs. On each iteration Nutch could discover new URLs making the process incremental.
What you can do is execute the bin/nutch readdb
command passing the -stats
parameter, something like:
$ bin/nutch readdb crawl/crawldb -stats
This should bring an output similar to:
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 575
retry 0: 569
retry 1: 6
min score: 0.0
avg score: 0.0069252173
max score: 1.049
status 1 (db_unfetched): 391
status 2 (db_fetched): 129
status 3 (db_gone): 53
status 4 (db_redir_temp): 1
status 5 (db_redir_perm): 1
CrawlDb statistics: done
With this info you could know the total amount of URLs discovered and how much of this have been fetched, along with some more useful information.
Upvotes: 2