Reputation: 108
I am trying to crawl data with Apache Nutch and index it with Apache Solr.
As part of this I want to parse the content as well. I am trying to figure out is it better to apply Tika to Nutch , to Solr or both.
Upvotes: 0
Views: 184
Reputation: 9789
Apply it as early as you can but make sure to keep the original, full-fidelity, document somewhere as well.
There is no point passing a binary file around if you know that in the end you are going to reduce it to a set of metadata fields and get rid of the rest.
Upvotes: 2