zebra
zebra

Reputation: 1338

Improving lucene spellcheck

I have a lucene index, the documents are in around 20 different languages, and all are in the same index, I have a field 'lng' which I use to filter the results in only one language.

Based on this index I implemented spell-checker, the issue is that I get suggestions from all languages, which are irrelevant (if I am searching in English, suggestions in German are not what I need). My first idea was to create a different spell-check index for each language and than select index based on the language of the query, but I do not like this, is it possible to add additional column in spell-check index and use this, or is there some better way to do this?

Another question is how could I improve suggestions for 2 or more Terms in search query, currently I just do it for the first, which can be strongly improved to use them in combination, but I could not find any samples, or implementations which could help me solve this issue.

thanks almir

Upvotes: 3

Views: 2317

Answers (3)

Xodarap
Xodarap

Reputation: 11849

If you look at the source of SpellChecker.SuggestSimilar you can see:

    BooleanQuery query = new BooleanQuery();
    String[] grams;
    String key;

    for (int ng = GetMin(lengthWord); ng <= GetMax(lengthWord); ng++)
    {
      <...>
      if (bStart > 0)
      { 
         Add(query, "start" + ng, grams[0], bStart); // matches start of word
      }
      <...>

I.E. the suggestion search is just a bunch of OR'd boolean queries. You can certainly modify this code here with something like:

  query.Add(new BooleanClause(new TermQuery(new Term("Language", "German")),
                    BooleanClause.Occur.MUST));

which will only look for suggestions in German. There is no way to do this without modifying your code though, apart from having multiple spellcheckers.


To deal with multiple terms, use QueryTermExtractor to get an array of your terms. Do spellcheck for each, and cartesian join. You may want to run a query on each combo and then sort based on the frequency they occur (like how the single-word spellchecker works).

Upvotes: 2

Artur Nowak
Artur Nowak

Reputation: 5354

As far as I know, it's not possible to add a 'language' field to the spellchecker index. I think that you need to define several search SpellCheckers to achieve this.

EDIT: As it turned out in the comments that the language of the query is entered by the user as well, then my answer is limited to: define multiple spellcheckers. As for the second question that you added, I think that it was discussed before, for example here.

However, even if it would be possible, it doesn't solve the biggest problem, which is the detection of query language. It is highly non-trivial task for very short messages that can include acronyms, proper nouns and slang terms. Simple n-gram based methods can be inaccurate (as e.g. the language detector from Tika). So I think that the most challenging part is how to use certainty scores from both language detector and spellchecker and what threshold should be chosen to provide meaningful corrections (e.g. language detector prefers German, but spellchecker has a good match in Danish...).

Upvotes: 2

Homer6
Homer6

Reputation: 15159

After implement two different search features in two different sites with both lucene and sphinx, I can say that sphinx is the clear winner.

Consider using http://sphinxsearch.com/ instead of lucene. It's used by craigslist, among others.

They have a feature called morphology preprocessors:

 # a list of morphology preprocessors to apply
 # optional, default is empty
 #
 # builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
 # 'soundex', and 'metaphone'; additional preprocessors available from
 # libstemmer are 'libstemmer_XXX', where XXX is algorithm code
 # (see libstemmer_c/libstemmer/modules.txt)
 #
 # morphology  = stem_en, stem_ru, soundex
 # morphology = libstemmer_german
 # morphology = libstemmer_sv
 morphology  = none

There are many stemmers available, and as you can see, german is among them.

UPDATE:

Elaboration on why I feel that sphinx has been the clear winner for me.

  • Speed: Sphinx is stupid fast. Both indexing and in the serving search queries.
  • Relevance: Though it's hard to quantify this, I felt that I was able to get more relevant results with sphinx compared to my lucene implementation.
  • Dependence on the filesystem: With lucene, I was unable to break the dependence on the filesystem. And while their are workarounds, like creating a ram disk, I felt it was easier to just select the "run only in memory" option of sphinx. This has implications for websites with more than one webserver, adding dynamic data to the index, reindexing, etc.

Yes, these are just points of an opinion. However, they are an opinion from someone that has tried both systems.

Hope that helps...

Upvotes: -1

Related Questions