Reputation: 1286
I have a product database (for soccer) which contains about 5k products. The Lucene index for product search currently contains Name, Category, Color and Numbers (ArtNo and EANs).
Relevant table example for the problem:
| Name | Color | -------------------------------------------- | Nike Training football | red black | | Nike Match football | black white | --------------------------------------------
For the index I have created a custom Analyzer, so I can extend a StandardAnalyzer with additional behavior. The creation of the stream looks like this at the moment:
TokenStream result = new StandardTokenizer(Util.Version.LUCENE_29, reader );
result = new StandardFilter(result);
result = new LowerCaseFilter(result);
result = new StopFilter(true, result, stoptable );
return result;
The Analyzer is used both for the Indexer and the Searcher.
This is the current search logic:
BooleanQuery booleanQuery = new BooleanQuery(true);
string[] terms = query.Split(' ');
foreach (string s in terms)
{
BooleanQuery subQuery = new BooleanQuery(true);
var nameQuery = new FuzzyQuery(new Term("Name", s), 0.9f);
nameQuery.SetBoost(6);
subQuery.Add(nameQuery, BooleanClause.Occur.SHOULD);
var colorQuery = new TermQuery(new Term("Color", s));
subQuery.Add(colorQuery, BooleanClause.Occur.SHOULD);
var categoryQuery = new FuzzyQuery(new Term("Category", s), 0.9f);
categoryQuery.SetBoost(2);
subQuery.Add(categoryQuery, BooleanClause.Occur.SHOULD);
var numbersQuery = new TermQuery(new Term("Numbers", s));
numbersQuery.SetBoost(10);
subQuery.Add(numbersQuery, BooleanClause.Occur.SHOULD);
booleanQuery.Add(subQuery, BooleanClause.Occur.MUST);
}
It works somehow already.
A lot of products have names or categories with words a user just won't search. In the example I have used "Nike Match football". (Note: I have only translated it for use on SO, as most of the terms are German in the database)
If I search for "Nike football red" I do get the result. But if a search for "Nike ball red" I don't get it, although this is how users will search for it. Afaik Lucene can't search for substrings (except for wildcards), as it only compares tokens - I do need something like this.
I have made Name
and Category
fuzzy and gave every column an appropriate boost according to it's relevance.
I have already read about Ngrams, but I really don't know how to use it correctly. The indexer works when I add the NGramTokenFilter
to my custom analyzer. The problems here are, that I don't want it for every column (just name and category) and the results are completely weird when activating it.
If I add result = new NGramTokenFilter(result, 3, 4);
to my analyzer and search for "nike ball" it just returns nothing.
Is Ngrams the solution here? What am I doing wrong?
And do you have any other suggestions on how to improve a product search?
Upvotes: 1
Views: 963
Reputation: 763
I' not familiar with Ngrams but as i see there are two approaches in your case:
1. Work with wildcards in searches
use Prefix or Fuzzy Queries on the fields you like to search. Important is that you use TextField ( Javadoc) because this fields are going to be analyzed (StringField don't) and are used for fulltext searches. Based on this it should be possible to search with multiple not exact matching terms.
2. Work with different analyzers for different fields
You can analyze different fields with different analyzers with the PerFieldAnalyzerWrapper Javadoc). Define which field should be analyzed with which analyzer and you're good to go. But be aware that you use the same analyzer for indexing and searching (it's lucene best practices)
Additional Informations
If you use Wildcards and Umlauts (German yaaaay) you have to know that Wildcard queries are not going to be analyzed like normal queries. i faced the same problem and solved it with two kind of field:
And while searching a BooleanQuery over this two fields.
Upvotes: 1