How to get job status of crawl tasks in nutch

Question

In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job. Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ? To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.

Julien Nioche · Accepted Answer

You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).

How to get job status of crawl tasks in nutch

Answers (1)

Related Questions