TMBT
TMBT

Reputation: 1183

Solr 5.1 spellchecker sometimes returns special characters in suggestions

Background

I have a Solr spellchecker configured like the following in schema.xml:

<fieldType name="spell_field" class="solr.TextField">
            <analyzer type="index">
                <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords.txt" />
        <filter class="solr.LengthFilterFactory" min="3" max="255" />
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" />
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
            </analyzer>
            <analyzer type="query">
                <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords.txt" />
        <filter class="solr.LengthFilterFactory" min="3" max="255" />
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" />
            <tokenizer class="solr.WhitespaceTokenizerFactory" />    
    </analyzer>
        </fieldType>

which is used for:

<field name="spellcheck" type="spell_field" indexed="true" stored="false" multiValued="true" />

and like the following in solrconfig.xml:

<requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <int name="rows">10</int>
      <str name="df">dflt</str>
      <str name="spellcheck">true</str>
      <str name="spellcheck.dictionary">suggest</str>
      <str name="spellcheck.count">10</str>
      <str name="spellcheck.collate">true</str>
      <str name="spellcheck.maxCollations">3</str>
      <str name="spellcheck.collateMaxCollectDocs">1</str>
      <str name="spellcheck.maxCollationTries">2</str>
    </lst>
    <arr name="last-components">
        <str>suggest</str>
    </arr>
  </requestHandler>

  <searchComponent class="solr.SpellCheckComponent" name="suggest">
    <str name="queryAnalyzerFieldType">spellcheck</str>
    <lst name="spellchecker">
      <str name="name">suggest</str>    
      <str name="field">spellcheck</str>
      <str name="classname">solr.DirectSolrSpellChecker</str>
      <int name="minPrefix">1</int>
      <int name="minQueryLength">3</int>
      <int name="maxEdits">2</int>
      <int name="maxInspections">3</int>
      <int name="minQueryLength">3</int>
      <float name="maxQueryFrequency">0.01</float>
      <float name="thresholdTokenFrequency">.00001</float>
      <float name="accuracy">0.5</float>
    </lst>
  </searchComponent>

The problem

Solr will sometimes return search results with special characters in them as the first suggestion. This is a problem because my application uses the first to rebuild the query.

For example, if I search on "VOLTAGER", the first spelling suggestion Solr produces is "voltage:", so the rebuilt query looks like myField:voltage:. Then, after the query is sent, Solr's logger displays the following warning: SpellCheckCollator: Exception trying to re-query to check if a spell check possibility would return any hits.

The underlying Exception is a parse error because myField:voltage: is not a valid query.

"VOLTAGER" also returns a plain "voltage", but further down the suggestion list, and my requirements state I must grab the first spelling correction from the list.

Ideally, in the above example, "VOLTAGER" would only return "voltage".

What I've Tried

I tried adding the following line to the index and query analyzer in the spell_field field type:

<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9])" replacement=""/>

This did remove all special characters from the spellchecker, but it had the nasty side effect of also sharply reducing the amount of results returned from the spellchecker. For example, "VOLTAGER" no longer returns anything. Neither does "circut", which normally returns "circuit".

Currently, I have the following line in the Java application that connects to Solr:

correctedTerms = correctedTerms.replaceAll("[^A-Za-z0-9]", "");

It works by making sure whatever is returned has no special characters, but I would much rather configure Solr's spellchecker to stop returning corrections with special characters in the first place.

In summary

I'm trying to get Solr's spellchecker to stop returning special characters in its suggestions. Basically I just want letters returned. How do I achieve what I want?

Upvotes: 0

Views: 726

Answers (1)

TMBT
TMBT

Reputation: 1183

In my original question, I was apparently a bit confused about who was causing what errors and where. The ultimate problem was Solr was automatically testing collations with terms that had illegal ASCII characters appended to them (the : character, usually). The special characters weren't coming from collation, however, they were just returned by the spellchecker and even if I removed all special characters from my analyzed fields, the spellchecker would continue to return some suggestions with the : character appended.

The way I solved this problem was to just remove the collator itself. So now my spellcheck config looks like this:

<requestHandler name="/select" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <int name="rows">10</int>
      <str name="df">dflt</str>
      <str name="spellcheck">true</str>
      <str name="spellcheck.dictionary">suggest</str>
      <str name="spellcheck.count">10</str>
    </lst>
    <arr name="last-components">
        <str>suggest</str>
    </arr>
  </requestHandler>

and I still have the following in my code when retrieving suggestions from the Suggestion Map:

correctedTerms = correctedTerms.replaceAll("[^A-Za-z0-9]", "");

Annoying, but at least now Solr isn't throwing a bunch of exceptions every time the collator fails and my code can provide a safety net to make sure nothing illegal makes it down to Solr.

The downside is I now have to do collations myself and, unlike Solr, I can't really guarantee any one collation will produce results. That said, my requirements aren't very heavy duty for the spellchecker, so while this behavior is undesirable, it's not unacceptable.

If anybody has had this problem and solved it without removing the collator, I would be very interested to hear about it.

Upvotes: 2

Related Questions