Hafiz Muhammad Shafiq
Hafiz Muhammad Shafiq

Reputation: 8680

Apache Nutch 1.17, Dump parsed content with some metadata into JSON

I have setup Nutch 1.17 to crawl some data. After downloading, I have to import that data to JSON. It should contain parsed text, title, timestamp, URL. How can I do it ?

Upvotes: 0

Views: 305

Answers (2)

Sebastian Nagel
Sebastian Nagel

Reputation: 2239

Alternatively, indexer-csv could be used as first step (conversion of CSV to JSON would be the second step). Indexer-csv allows to configure which Nutch index fields to export - title, URL ("id"), timestamp ("tstamp") and parsed text ("content") are provided as standards fields or via the plugin "index-basic".

Upvotes: 1

Jorge Luis
Jorge Luis

Reputation: 3253

You can take a look at PR #490 which closed issue NUTCH-1863. This allows you to dump the CrawlDB into a JSON format (check the -format flag).

One potential drawback is that this tool probably will not output the exact format that you want/need (different field names), but it should be a good starting point (and it should contain more data than what you need).

Ultimately you could implement a custom class to dump the content of a segment in your desired format. You could use SegmentDump.java class as a base implementation.

Upvotes: 2

Related Questions