Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8670

apache nutch taking too long time in generate phase

I have two urls in my urls/seed file. My crawler taking too ling time before it start fetching. My already crawled data is about 220 GB . Any idea why nutch is behaving like this

Upvotes: 1

Views: 356

Answers (1)

Do Do
Do Do

Reputation: 723

Before fetching job, the generating job is performed in Nutch. In the generating job, Nutch will select topN URLs, which have the highest scores among all URLs in CrawlDB, for fetching. Therefore the reason of your crawler taking too long time before fetching would be you set topN is too high compared to your system capacity, and the number of URLs in crawlDB is large (selecting process will take time).

Hope this helps

Le Quoc Do

Upvotes: 1

Related Questions