apache nutch taking too long time in generate phase

Question

I have two urls in my urls/seed file. My crawler taking too ling time before it start fetching. My already crawled data is about 220 GB . Any idea why nutch is behaving like this

Do Do · Accepted Answer

Before fetching job, the generating job is performed in Nutch. In the generating job, Nutch will select topN URLs, which have the highest scores among all URLs in CrawlDB, for fetching. Therefore the reason of your crawler taking too long time before fetching would be you set topN is too high compared to your system capacity, and the number of URLs in crawlDB is large (selecting process will take time).

Hope this helps

Le Quoc Do

apache nutch taking too long time in generate phase

Answers (1)

Related Questions