user595014
user595014

Reputation: 124

Retrieving concatenated field value in solr

Enviornment ==> solr - solr-8.9.0, java version "11.0.12" 2021-07-20 LTS

Following .csv file is indexed in solr

id,cat,name,price,inStock,author,series_t,sequence_i,genre_s
0553573403,book,Game Thrones Clash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573404,book,Game ,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573405,book,Game Kings Storm,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573406,book,Games Clash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573407,book,Game Thronesa,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573408,book,Game Thrones Clash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573409,book,GameThrones Clash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573410,book,Game ThronesClash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy

field-type : text_general is configured for field :'name'.

Query shall run over input field 'name' having the value 'Game Thrones Clash'. What I need to do is somehow

  1. if there is a minimum of 75% of tokens are fuzzy matches then it should result in output. Output will be having id as '0553573403', '0553573406', '0553573407', '0553573408'.
  2. If 'name' values are concatenated then it should also result in output. The output will be having id as '0553573409'(GameThrones Clash), '0553573410'(Game ThronesClash).

I understand that Extended DisMax includes query parameters 'mm'(Minimum should match) with fuzzy search functionality. Following query is satisfying the criteria 1 :

curl -G http://$solrIp:8983/solr/testCore2/select --data-urlencode "q=(name:'Game~' OR name:'Thrones~' OR name:'Clash~')" --data-urlencode "defType=edismax" --data-urlencode "mm=75%" --data-urlencode "sort=id asc"
{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "mm":"75%",
      "q":"(name:'Game~' OR name:'Thrones~' OR name:'Clash~')",
      "defType":"edismax",
      "sort":"id asc"}},
  "response":{"numFound":4,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"0553573403",
        "cat":["book"],
        "name":["Game Thrones Clash"],
        "price":[7.99],
        "inStock":[true],
        "author":["George R.R. Martin"],
        "series_t":"A Song of Ice and Fire",
        "sequence_i":1,
        "genre_s":"fantasy",
        "_version_":1738145066191421440},
      {
        "id":"0553573406",
        "cat":["book"],
        "name":["Games Clash"],
        "price":[7.99],
        "inStock":[true],
        "author":["George R.R. Martin"],
        "series_t":"A Song of Ice and Fire",
        "sequence_i":1,
        "genre_s":"fantasy",
        "_version_":1738145066196664320},
      {
        "id":"0553573407",
        "cat":["book"],
        "name":["Game Thronesa"],
        "price":[7.99],
        "inStock":[true],
        "author":["George R.R. Martin"],
        "series_t":"A Song of Ice and Fire",
        "sequence_i":1,
        "genre_s":"fantasy",
        "_version_":1738145066196664321},
      {
        "id":"0553573408",
        "cat":["book"],
        "name":["Game Thrones Clash"],
        "price":[7.99],
        "inStock":[true],
        "author":["George R.R. Martin"],
        "series_t":"A Song of Ice and Fire",
        "sequence_i":1,
        "genre_s":"fantasy",
        "_version_":1738145066197712896}]
  }}

but following documents are not part of the output.

0553573409,book,GameThrones Clash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573410,book,Game ThronesClash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy

What to change in the query to result above documents?

For Field 'Name', multivalued will be false.

Note - A change in case within a word: "CamelCase" is not neccessay. It may be 'Gamethrones Clash', 'Game Thronesclash'

I also indexed data with 'ShingleFilterFactory' for fieldType : text_general.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ShingleFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

But not able to understand what to change in query to retrieve all following results for input value ('Game Thrones Clash') with a minimum of two token fuzzy-matched.

  1. '0553573403', '0553573406', '0553573407', '0553573408'
  2. '0553573409'(GameThrones Clash), '0553573410'(Game ThronesClash).

Upvotes: 0

Views: 31

Answers (0)

Related Questions