Reputation: 1715
In a crawl cycle, we have many tasks/phases like inject,generate,fetch,parse,updatedb,invertlinks,dedup and an index job. Now I would like to know is there any methodologies to get status of a crawl task(whether it is running or failed) by any means other than referring to hadoop.log file ? To be more precise I would like to know whether I can track status of a generate/fetch/parse phase ? Any help would be appreciated.
Upvotes: 2
Views: 283
Reputation: 4854
You should always run Nutch with Hadoop in pseudo or fully distributed mode, this way you'll be able to use the Hadoop UI to track the progress of your crawls, see the logs for each step, access the counters (extremely useful!).
Upvotes: 3