Lucene: add facets to existing index

Question

I'm a bit stumped about how to add facets to an already existing Lucene index.

I have a Lucene index (created without any facets) created using Lucene 3.1.

I've looked over the Lucene documentation for facets, and there they show you how to create from scratch an index with facets, i.e. you create a new Lucene Document object, use the taxonomy tools to add facet information to it (categories) and then write that document in the Lucene index (using IndexWriter) and this will also add extra data to the taxonomy index (via TaxonomyWriter), as described here:

http://lucene.apache.org/core/3_6_2/api/all/org/apache/lucene/facet/doc-files/userguide.html#facet_accumulation

However, what I want is to use the data already stored in the existing Lucene index, and from it create a new Lucene index, (with taxonomy index alongside it) that will contain the exact same data as the original index, plus the various category information.

My question is more precisely:

Is it enough to read a document from the original index, create its CategoryPath, and then write it to the new index, like this:

//get a document from original Lucene index:
Query query = queryParser.parse("*:*");
originalTopDocs = originalIndexSearcher.search(query,100);
Document originalDocument = originalIndexSearcher.doc(originalTopDocs.scoreDocs[1].doc)

//create categories for original document
CategoryDocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxonomyWriter);
categoryDocBuilder.setCategoryPaths(categoriesPaths);

//create new document from original document + categories:
Document originalDocumentWithCategories = categoryDocBuilder.build(originalDocument);

//write new document to new index:
newIndexWriter.write(originalDocumentWithCategories);

Does the above code index the same document as it was stored in the original index, but with added categories data? For example, will the data for the non-stored fields from the original document still be present in the newly created and indexed document?

Also is there a better way to do this update (maybe not create a new index)...

Shivan Dragon · Accepted Answer

OK well, here's some insights on how I solved this:

If you wanna do it with Lucene-only (as described in the question), you can only do that if :
- All the fields you need have also been stored in the original index. If there are fields which have only been indexed (and not stored) than you can't recover them in order to re-index them in the new index (with facets)
- You must also have knowledge of the Analyzers used to create the original index AND those used for creating queries:
  - the original index-time Analyzers are needed in order to get the same terms (from the stored values) when creating the new indexes
  - the Analyzers used on various QueryParsers when creating queries on the original index are needed to be able to re-construct the same queries for the new index

All this being said, I've noticed that, at least for the facet part, it's easier to implement using Solr, and, at least for my situation the performances do not degrade, but are in fact better sometimes. The advantage with Solr is that it creates facets "auto magically" (on all the fields that are pertinent to facetting). No extra facet indexing, no manual declaration of facet "paths" etc. And the Solr query API for facets is friendlier than the Lucene one as well.

Problems you might get when migrating from Lucene to Solr are :

You still need all the info on the Lucene Analyzers used to index and query the initial Lucene index. The fact that you pass to Solr also adds the overhead of seeing how those Lucene Analysers map to what Solr has to offer (most Solr Analyzers/Filters are identical to those of Lucene, but not all)
Solr has no Lucene programmatic query API (there's no way to do new SpanQuery("My blue boat*") and auto magically have the correct query terms created behind the scenes). If you want to translate Lucene queries which make heavy use of said programmatic query API to Solr queries, you have to make your own tools that generate the corresponding Lucene query string. You can of course still build the query objects using Lucene API, and then do a toString() on them before sending them to Solr, but this doesn't work all the time, and can get really complicated for certain, complex, queries.

Lucene: add facets to existing index

Answers (1)

Related Questions