Sachin
Sachin

Reputation: 1715

How to get job status of crawl tasks in nutch

In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job. Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ? To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.

Upvotes: 2

Views: 283

Answers (1)

Julien Nioche
Julien Nioche

Reputation: 4854

You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).

Upvotes: 3

Related Questions