Reputation: 1338
I have a lucene index, the documents are in around 20 different languages, and all are in the same index, I have a field 'lng' which I use to filter the results in only one language.
Based on this index I implemented spell-checker, the issue is that I get suggestions from all languages, which are irrelevant (if I am searching in English, suggestions in German are not what I need). My first idea was to create a different spell-check index for each language and than select index based on the language of the query, but I do not like this, is it possible to add additional column in spell-check index and use this, or is there some better way to do this?
Another question is how could I improve suggestions for 2 or more Terms in search query, currently I just do it for the first, which can be strongly improved to use them in combination, but I could not find any samples, or implementations which could help me solve this issue.
thanks almir
Upvotes: 3
Views: 2317
Reputation: 11849
If you look at the source of SpellChecker.SuggestSimilar
you can see:
BooleanQuery query = new BooleanQuery();
String[] grams;
String key;
for (int ng = GetMin(lengthWord); ng <= GetMax(lengthWord); ng++)
{
<...>
if (bStart > 0)
{
Add(query, "start" + ng, grams[0], bStart); // matches start of word
}
<...>
I.E. the suggestion search is just a bunch of OR'd boolean queries. You can certainly modify this code here with something like:
query.Add(new BooleanClause(new TermQuery(new Term("Language", "German")),
BooleanClause.Occur.MUST));
which will only look for suggestions in German. There is no way to do this without modifying your code though, apart from having multiple spellcheckers.
To deal with multiple terms, use QueryTermExtractor to get an array of your terms. Do spellcheck for each, and cartesian join. You may want to run a query on each combo and then sort based on the frequency they occur (like how the single-word spellchecker works).
Upvotes: 2
Reputation: 5354
As far as I know, it's not possible to add a 'language' field to the spellchecker index. I think that you need to define several search SpellChecker
s to achieve this.
EDIT: As it turned out in the comments that the language of the query is entered by the user as well, then my answer is limited to: define multiple spellcheckers. As for the second question that you added, I think that it was discussed before, for example here.
However, even if it would be possible, it doesn't solve the biggest problem, which is the detection of query language. It is highly non-trivial task for very short messages that can include acronyms, proper nouns and slang terms. Simple n-gram based methods can be inaccurate (as e.g. the language detector from Tika). So I think that the most challenging part is how to use certainty scores from both language detector and spellchecker and what threshold should be chosen to provide meaningful corrections (e.g. language detector prefers German, but spellchecker has a good match in Danish...).
Upvotes: 2
Reputation: 15159
After implement two different search features in two different sites with both lucene and sphinx, I can say that sphinx is the clear winner.
Consider using http://sphinxsearch.com/ instead of lucene. It's used by craigslist, among others.
They have a feature called morphology preprocessors:
# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology = stem_en, stem_ru, soundex
# morphology = libstemmer_german
# morphology = libstemmer_sv
morphology = none
There are many stemmers available, and as you can see, german is among them.
UPDATE:
Elaboration on why I feel that sphinx has been the clear winner for me.
Yes, these are just points of an opinion. However, they are an opinion from someone that has tried both systems.
Hope that helps...
Upvotes: -1