Iwko
Iwko

Reputation: 73

Search phrase through SOLR multivalued field

I am implementing SOLR search. When I type "abc def" I want to get all paragraphs that contains "abc def". For example if I have those paragraphs.

{
    "paragraphs": ["abc def. bdbdbdbdbd, aa", "abd efe"]
},
{   
    "paragraphs": ["xyzabc def xyz", "fgh xx", "abcdef", "wwwabc defxxx"]
}

I want to get data from the first one. Exact match this prase so not a part of another phrase. If I search for "god dog" phrase"god doggo" should not be included in results.

The problem is when I try to use query paragraphs : "abc def" I am getting empty results.

This is part of my schema.xml:

  <field name="paragraphs" type="text" indexed="true" stored="true" required="true" multiValued="true"/>
  <types>
    <fieldType name="text" class="solr.TextField" sortMissingLast="true" omitNorms="true">
        <analyzer type="index">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true">
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
</types>

I tried to use StandardTokenizerFactory instead of KeywordTokenizerFactory but result was the same. I can get data using (*abc*) but this returns also elements like xabcz and I am not interested in this.

Upvotes: 0

Views: 301

Answers (1)

MatsLindh
MatsLindh

Reputation: 52792

You'll have to drop the KeywordTokenizer - this keeps the whole, stored text as a single token.

Using the WhitespaceTokenizer or the StandardTokenizer should work, remember that you have to reindex after changing the analysis chain in any way (unless you're only changing how content is processed for querying).

Using the default dynamic field *_txt (defined as a StandardTokenizer with only lowercasing and stopword removal), and with your two documents indexed:

q=*:*

"response":{"numFound":2,"start":0,"docs":[
    {
        "paragraphs_txt":["abc def. bdbdbdbdbd, aa",
          "abd efe"],
        "id":"d696c435-2267-442d-9abe-ea754793d5cf",
        "_version_":1602547400543567872},
    {
        "paragraphs_txt":["xyzabc def xyz",
          "fgh xx",
          "abcdef",
          "wwwabc defxxx"],
        "id":"09bbba7c-b407-403c-9771-582ef23f6b56",
        "_version_":1602547400598093824}]
}}

q=paragraphcs_txt:"abc def"

"response":{"numFound":1,"start":0,"docs":[
    {
        "paragraphs_txt":["abc def. bdbdbdbdbd, aa",
          "abd efe"],
        "id":"d696c435-2267-442d-9abe-ea754793d5cf",
        "_version_":1602547400543567872}]
}}

Upvotes: 1

Related Questions