kudarap
kudarap

Reputation: 781

Parse microdata using apache tika plugin on apache nutch

my objective is to - crawl on urls and - extract micro data and - save to solr

I used this guide to setup nutch, hbase, and solr

Im using nutch to crawl on urls and hbase, im using tika pluggin for nutch to parse pages, but it only gets meta data.

Did I miss something to config? please guide me or suggest alternatives

Upvotes: 0

Views: 622

Answers (1)

Julien Nioche
Julien Nioche

Reputation: 4864

You need to implement your own ParseFilter and implement the extraction logic there. You will get a DocumentFragment generated by the Tika parser and could use e.g. XPath to get the micro data.

Note that the DOM generated by Tika are heavily normalised / modified so your Xpath expressions could possibly not match. Maybe better to rely on the old HTML parser instead.

One generic way of doing would be to use Apache Any23 as done for instance in this storm-crawler module.

BTW There is an open JIRA for a MicroDataHandler in Tika which hasn't been committed yet.

HTH

Upvotes: 1

Related Questions