Reputation: 5719
I would have a question concerning analyzing documents. With Apache Tika, it is possible to get content and metadata of different files with different types.
Is it also possible to get keywords of files (i.e. stemming) with Tika or do I still need Lucene for that?
Upvotes: 5
Views: 2152
Reputation: 121
Tika and Lucene do different things.
Tika exists to grab data out of files. For example, you can use Tika to extract the text out of a PDF.
Lucene is an indexer. So, when you provide Lucene with Doc1.txt, Doc2.txt and Doc3.txt, it will index them such that later you can search for a word or phrase like 'hello' and Lucene will respond with a list of documents that contain that word, and the number of times in each document.
If you're going to index arbitrary content, you might use Tika to first extract the text, and then Lucene to index it.
Upvotes: 2
Reputation: 763
I don't know if it's possible but i would recommend doing all the keyword analysis in lucene. My personal reasons:
Upvotes: 4