Allan Macmillan
Allan Macmillan

Reputation: 1491

Dumping Nutch Crawldb

How can I get a dump of the Nutch crawldb of all the urls with status 3 (db_gone). The version of Nutch I am using 1.4.

I looked at the wiki but it is unclear on how to do this

Upvotes: 1

Views: 946

Answers (1)

BitByter GS
BitByter GS

Reputation: 999

CrawlDbReader in Nutch 1.4 don't generate dump of crawldb on the basis of Document's status. In 1.5 and later versions of Nutch you can specify status of document during crawldb reading and readdb will generate dump of documents with specified status.

[root@srchengn nutch]# bin/nutch readdb <path_crawldb> -dump <output_directory> -status db_gone

If you want to do the same in Nutch 1.4 you have to modify org.apache.nutch.crawl.CrawlDbReader class.

Upvotes: 2

Related Questions