Reputation:
We plan to use lucene as FTI-service. Amongst other things, we want to build a tag-index, based on a tag-attribute of our documents that simply contains space-delimited tags.
Now for suggesting tag-completions, it would be great if there was a way to access all unique keywords of a given index. Lucene must be able to do that internally, as it uses that to complete like-queries to rewrite them using OR.
Any suggestions?
Upvotes: 2
Views: 2584
Reputation: 19402
You need to do two things:
1) When you create your document to index, make sure you use "ANALYZED
"
doc.add(new Field("tags", tags, Field.Store.NO, Field.Index.ANALYZED));
2) Use a boolean query and OR all the terms:
BooleanQuery query = new BooleanQuery();
for( String tag : tags){
query.add(new TermQuery("tags", tag), BooleanClause.Occur.SHOULD);
}
TopDocs docs = searcher.search(query, null, searchLimit);
Upvotes: 1
Reputation: 5052
Be careful about using terms from the index directly. If you have stemming enabled while indexing, all funny strings will start appearing in the term list. "Beauty" gets stemmed to "beauti", "create" is transformed to "creat" and so on.
Upvotes: 1
Reputation: 103
Tag completion needs to come from either (a) a prefix query on your list of tags (like pytho*) , or (b) via a query on a ngram-tokenized field (for example, Lucene will index python as p, py, pyt, pytho, python in a separate field.) Both of these solutions allow you to do tag-completion queries on the fly.
What you're suggesting (and what Coady's response will get you) is a more offline approach or something that you don't really want to run at query time. This is also fine-- tag dictionaries are not expected to be in realtime-- but be aware that iterating through IndexReader's terms is not meant to be a "query-time" operation.
Upvotes: 1
Reputation: 57418
Use IndexReader.terms to get all the term values (and doc counts) for your tag field.
Upvotes: 5
Reputation: 9855
If you are trying to do a tag completion you don't need all the unique tags, you need the tags that match what the user has already entered. This can be done with a wildcard, fuzzy, span, or proefix query depending on the need.
Upvotes: 0