Reputation: 1491
How can I get a dump of the Nutch crawldb of all the urls with status 3 (db_gone). The version of Nutch I am using 1.4.
I looked at the wiki but it is unclear on how to do this
Upvotes: 1
Views: 946
Reputation: 999
CrawlDbReader in Nutch 1.4 don't generate dump of crawldb on the basis of Document's status. In 1.5 and later versions of Nutch you can specify status of document during crawldb reading and readdb will generate dump of documents with specified status.
[root@srchengn nutch]# bin/nutch readdb <path_crawldb> -dump <output_directory> -status db_gone
If you want to do the same in Nutch 1.4 you have to modify org.apache.nutch.crawl.CrawlDbReader
class.
Upvotes: 2