entropy
entropy

Reputation: 23

Solr Stemming Result Documents do not show up even though counted in the resultset

I'm new to solr. I'm trying to configure solr 6.3 using solarium but I run on a stemming issue. My collection of documents has words like: "call", "calls", "called", "calling" and "serv", "serve", "serves", "served" and "serving". I have 'serv' in there in an effort to understand the behavior of the stemmer with the produced stem. When I query solr from my solarium php page, the number of results obtained indicates that all documents that have whatever form of the searched word are taken into account. However, it doesn't show me all of the documents. For example:

For the query: 'serv' It only shows the document with 'serv' For the query: 'serve' It only shows the document with 'serve'
For the query: 'serves' It only shows the document with 'serves' and 'serv' For the query: 'served' It only shows the document with 'served' and 'serv' For the query: 'serving' It only shows the document with 'serving' and 'serv'

In the case of 'call'

call --> call,
calls --> calls call,
called --> called call,
calling --> calling, call

So by the looks of it the documents that include the keyword and the actual stem show up with the term highlighted but the rest of the documents do not show.

I would expect the stemmer to bring up all these documents with the different occurences of the keyword. i.e a search for "call" should bring up "call" "calling" "called" "calls".

The relevant parts of my schema are as follows:

<field name="content" type="text_en" indexed="true" stored="true"/>
 <field name="_text_" type="stemmed_text" multiValued="true" indexed="true" stored="false"/>
 <dynamicField name="stemmed_*" type="stemmed_text" indexed="true" stored="false" />
 <copyField source="*" dest="_text_" />

<fieldType name="stemmed_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.KeywordRepeatFilterFactory"/>
  <filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" strictAffixParsing="true" />
  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.KeywordRepeatFilterFactory"/>
  <filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic" affix="en_GB.aff" ignoreCase="true" strictAffixParsing="true" />
  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index"> 
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
<analyzer type="query">
  <tokenizer class="solr.ClassicTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
  <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>   
  <filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>

Part of my php page is as follows: .....

// get a select query instance
    $query = $client->createSelect();
    $query->setFields(array('id', 'subject', 'content'));
// $query->setQuery('someWord');
    $query->setQuery($someWord);
    $query->setStart(0)->setRows($limit);
// get highlighting component and apply settings
    $hl = $query->getHighlighting();
    $hl->setSnippets(15);
    $hl->setFields(array('content'));
    $hl->setSimplePrefix('<strong>');
    $hl->setSimplePostfix('</strong>');

.....

foreach ($resultset AS $document) {
            $subj ='';     
            if (is_array($document->subject))  {
                $subj = implode(', ', $document->subject);  
            }       
                echo '<table style="margin-bottom:20px; text-align:left; border:none; width:500px">';
                $highlightedDoc = $highlighting->getResult($document->id);
            if ($highlightedDoc) {  
                foreach ($highlightedDoc as $field => $highlight) {
                    echo $subj;
                    echo implode(' (...) ', $highlight) . '<br/>';
            }   
        }

        echo '</table>';
        } 

I use the solrconfig that comes with the solr installation. I would greatly appreciate it if someone could tell me what I am doing wrong. Am I missing something from my schema or is there some setting I have to configure in the solrconfig? As my last resort I am thinking of using the solr.EdgeNGramFilterFactory but I would like to avoid this. I am attaching a link to an image of my solr analysis screen.

Thank you in advance.

Solr Analysis for the word "calling"

Solr Admin Console Showing Highlighting

Upvotes: 1

Views: 272

Answers (0)

Related Questions