Reputation: 8670
I have two urls in my urls/seed file. My crawler taking too ling time before it start fetching. My already crawled data is about 220 GB . Any idea why nutch is behaving like this
Upvotes: 1
Views: 356
Reputation: 723
Before fetching job, the generating job is performed in Nutch. In the generating job, Nutch will select topN URLs, which have the highest scores among all URLs in CrawlDB, for fetching. Therefore the reason of your crawler taking too long time before fetching would be you set topN is too high compared to your system capacity, and the number of URLs in crawlDB is large (selecting process will take time).
Hope this helps
Le Quoc Do
Upvotes: 1