Cloud Falls
Cloud Falls

Reputation: 429

Apache Solr did not match Exact String

I have an issue and really I'm not sure what can I do...

This is very simple, I have 2 indexes created in SORL:

"Scholastic Reader, Level 2 >" "Scholastic Reader, Level 3 >"

(The symbol > goes to the end of the string)

Search 1: When I search by "Scholastic Reader, Level" the service return both indexes, which is good.

XML Response:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">2</int>
        <lst name="params">
            <str name="indent">on</str>
            <str name="start">0</str>
            <str name="q">type:masterseries AND title:("Scholastic Reader, Level")</str>
            <str name="version">2.2</str>
            <str name="rows">10</str>
        </lst>
    </lst>
    <result name="response" numFound="2" start="0">
        <doc>
            <str name="id">118</str>
            <arr name="title">
                <str>Scholastic Reader, Level 2 ></str>
            </arr>
            <str name="type">masterseries</str>
            <str name="uuid">3bf5b10c-a286-4ad0-9c63-bb402f57a7ed</str>
        </doc>
        <doc>
            <str name="id">118</str>
            <arr name="title">
                <str>Scholastic Reader, Level 3 ></str>
            </arr>
            <str name="type">masterseries</str>
            <str name="uuid">cdb19c28-0988-4375-acf0-916bc6664055</str>
        </doc>
    </result>
</response>

Search 2: Searching by "Scholastic Reader, Level 3", it will return the "Scholastic Reader, Level 3 >" GREAT!

Query String: type:masterseries AND title:("Scholastic Reader, Level 3") XML Response:

    <response>
    <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">2</int>
    <lst name="params">
        <str name="indent">on</str>
        <str name="start">0</str>
        <str name="q">type:masterseries AND title:("Scholastic Reader, Level 3")</str>
        <str name="version">2.2</str>
        <str name="rows">10</str>
    </lst>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <str name="id">118</str>
            <arr name="title">
                <str>Scholastic Reader, Level 3 ></str>
            </arr>
            <str name="type">masterseries</str>
            <str name="uuid">cdb19c28-0988-4375-acf0-916bc6664055</str>
        </doc>
    </result>
</response>

But here's come the weird things

Search 3: Searching by "Scholastic Reader, Level 2", or even the exact string "Scholastic Reader, Level 2 >" Return "NOTHING"

Query String: type:masterseries AND title:("Scholastic Reader, Level 2") XML RESPONSE:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">2</int>
        <lst name="params">
            <str name="indent">on</str>
            <str name="start">0</str>
            <str name="q">type:masterseries AND title:("Scholastic Reader, Level 2")</str>
            <str name="version">2.2</str>
            <str name="rows">10</str>
        </lst>
    </lst>
<result name="response" numFound="0" start="0"/>
</response>

Even I created indexes with numbers like 1, 4,5,6 and it works, but the String with the level "2" does not work.

Thanks for your help.

UPDATE:

Adding some configuration in the schema.xml file:

 <fieldType name="text_en" class="solr.TextField"
        positionIncrementGap="100">
        <analyzer type="index">
            <charFilter class="solr.HTMLStripCharFilterFactory" />
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.ISOLatin1AccentFilterFactory" />
            <filter class="solr.StopFilterFactory"
                ignoreCase="true" words="lang/stopwords_en.txt"
                enablePositionIncrements="false" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.EnglishPossessiveFilterFactory" />
            <filter class="solr.KeywordMarkerFilterFactory"
                protected="protwords.txt" />
            <filter class="solr.PorterStemFilterFactory" />
        </analyzer>
        <analyzer type="query">
            <charFilter class="solr.HTMLStripCharFilterFactory" />            
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.SynonymFilterFactory"
                synonyms="synonyms.txt" ignoreCase="true" expand="true" />
            <filter class="solr.StopFilterFactory"
                ignoreCase="true" words="lang/stopwords_en.txt"
                enablePositionIncrements="false" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.ISOLatin1AccentFilterFactory" />
            <filter class="solr.EnglishPossessiveFilterFactory" />
            <filter class="solr.KeywordMarkerFilterFactory"
                protected="protwords.txt" />            
            <filter class="solr.PorterStemFilterFactory" />
        </analyzer>
    </fieldType>

Upvotes: 0

Views: 929

Answers (1)

femtoRgon
femtoRgon

Reputation: 33351

I would bet your problem is in:

<filter class="solr.SynonymFilterFactory"
            synonyms="synonyms.txt" ignoreCase="true" expand="true" />

Take a look at "synonyms.txt", and I would guess you will find an entry that replaces "2" with "too" (if it was "to" is would then be removed by the StopFilter and you'dd never notice a difference). Since expand=true, this would then result in a query that looks like:

"Scholastic Reader Level 2 too"

Which is fine for an unquoted set of TermQuerys, but not for a PhraseQuery. To fix this, you could incorporate the SynonymFilter into your "index" analyzer

Other possibility I can see would be that something odd is happening with ISOLatin1AccentFilterFactory coming after StopFilter and LowerCaseFilter, since the order in which filters are applied may result in different outputs, but I very much doubt that is the problem.

Upvotes: 2

Related Questions