Marco Gallo
Marco Gallo

Reputation: 49

Solr search by text

I have a problem searching a document in Solr by a query.
The document looks like this:

{
  "id": "890_03366_00739",
  "text": ["2509412 MARCO GLLMRC86E28L736X  03366 00739 "],
  "_version_": 1612212288969769000
}

If i search with query text:GLLMRC86E28L736 i found correctly the document.
If i try with query text:GLLMRC86E28L736X i can't find the document, why this happens?
In my schema the field text is declared as <field name="text" type="text_general" indexed="true" required="true" stored="true"/>
I'm using Solr 7.0.0.

UPDATE:
The "Analysis" page shows this output for my field "text" and query GLLMRC86E28L736X
query GLLMRC86E28L736X
For query GLLMRC86E28L736 query GLLMRC86E28L736

Search by GLLMRC86E28L736X search 1 Search by GLLMRC86E28L736 search 2 The field type "text_general" is declared as

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StandardFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="2"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StandardFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>

Upvotes: 0

Views: 123

Answers (1)

MatsLindh
MatsLindh

Reputation: 52822

Your EdgeNgramFilter has a maxGramSize setting that cuts off the ending of the token - the X is dropped when indexing, while it's kept when querying (as it should, if you're attempting to match prefixes).

On the left side of the analysis screen you can see that it generates versions of GLLMRC86E28L736X, but the last character is dropped - i.e. it stops generating versions before adding the last one. The query is still GLLMRC86E28L736X, and since there is no token matching GLLMRC86E28L736X (only GLLMRC86E28L736 since it stopped after generating that), you get no hit.

Adjust the maxGramSize for your field, or search against a field that doesn't do any edgengramming if you want to get exact matches only.

In addition, this is not the default form for the text_general field type included in the examples if I remember correctly, so in the future it'll be helpful if you include the field type as well.

Upvotes: 1

Related Questions