Reputation: 8680
I have setup Nutch 1.17 to crawl some data. After downloading, I have to import that data to JSON. It should contain parsed text, title, timestamp, URL. How can I do it ?
Upvotes: 0
Views: 305
Reputation: 2239
Alternatively, indexer-csv could be used as first step (conversion of CSV to JSON would be the second step). Indexer-csv allows to configure which Nutch index fields to export - title, URL ("id"), timestamp ("tstamp") and parsed text ("content") are provided as standards fields or via the plugin "index-basic".
Upvotes: 1
Reputation: 3253
You can take a look at PR #490 which closed issue NUTCH-1863. This allows you to dump the CrawlDB into a JSON format (check the -format
flag).
One potential drawback is that this tool probably will not output the exact format that you want/need (different field names), but it should be a good starting point (and it should contain more data than what you need).
Ultimately you could implement a custom class to dump the content of a segment in your desired format. You could use SegmentDump.java class as a base implementation.
Upvotes: 2