Dumping Nutch Crawldb

Question

How can I get a dump of the Nutch crawldb of all the urls with status 3 (db_gone). The version of Nutch I am using 1.4.

I looked at the wiki but it is unclear on how to do this

BitByter GS · Accepted Answer

CrawlDbReader in Nutch 1.4 don't generate dump of crawldb on the basis of Document's status. In 1.5 and later versions of Nutch you can specify status of document during crawldb reading and readdb will generate dump of documents with specified status.

[root@srchengn nutch]# bin/nutch readdb  -dump  -status db_gone

If you want to do the same in Nutch 1.4 you have to modify org.apache.nutch.crawl.CrawlDbReader class.

Dumping Nutch Crawldb

Answers (1)

Related Questions