atoms
atoms

Reputation: 3093

Elasticsearch concatenate two words into one

I have a field ManufacturerName

"ManufacturerName": {
    "type": "keyword",
    "normalizer" : "keyword_lowercase"
},

And a normalizer

"normalizer": {
    "keyword_lowercase": {
       "type": "custom",
       "filter": ["lowercase"]
    }
}

When searching for 'ripcurl' it matches. However when searching for 'rip curl' it doesn't.

How/what would use to concatenate certain words. i.e. 'rip curl' -> 'ripcurl'

Apologies if this is a duplicate, I've spent some time seeking a solution to this.

Upvotes: 1

Views: 1423

Answers (1)

Kamal Kunjapur
Kamal Kunjapur

Reputation: 8860

You would want to make use of text field for what you are looking for and get this kind of requirement carried out via Ngram Tokenizer

Below is a sample mapping, query and response:

Mapping:

PUT mysomeindex
{
  "mappings": {
    "mydocs":{
      "properties": { 
        "ManufacturerName":{
          "type": "text",
          "analyzer": "my_analyzer", 
          "fields":{
            "keyword":{
              "type": "keyword",
              "normalizer": "my_normalizer"
            }
          }
        }
      }
    }
  }, 
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer":{
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter": [ "synonyms" ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "filter": {
        "synonyms":{
          "type": "synonym",
          "synonyms" : ["henry loyd, henry loid, henry lloyd => henri lloyd"]
        }
      }
    }
  }
}

Notice that the field ManufacturerName is a multi-field which has both text type and its sibling keyword type. That way for exact matches & for aggregation queries you could make use of keyword field while for this requirement you can make use of text field.

Sample Document:

POST mysomeindex/mydocs/1
{
  "ManufacturerName": "ripcurl"
}

POST mysomeindex/mydocs/2
{
  "ManufacturerName": "henri lloyd"
}

What elasticsearch does when you ingest the above document is, it creates tokens of size from 3 to 5 length and stored them in inverted index for e.g. `rip, ipc, pcu etc...

You can execute the below query to see what tokens gets created:

POST mysomeindex/_analyze
{
  "text": "ripcurl",
  "analyzer": "my_analyzer"
}

Also I'd suggest you to look into Edge Ngram tokenizer and see if that fits better for your requirement.

Query:

POST mysomeindex/_search
{
  "query": {
    "match": {
      "ManufacturerName": "rip curl"
    }
  }
}

Response:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.25316024,
    "hits": [
      {
        "_index": "mysomeindex",
        "_type": "mydocs",
        "_id": "1",
        "_score": 0.25316024,
        "_source": {
          "ManufacturerName": "ripcurl"
        }
      }
    ]
  }
}

Query for Synonyms:

POST mysomeindex/_search
{
  "query": {
    "match": {
      "ManufacturerName": "henri lloyd"
    }
  }
}

Response:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 2.2784421,
    "hits": [
      {
        "_index": "mysomeindex",
        "_type": "mydocs",
        "_id": "2",
        "_score": 2.2784421,
        "_source": {
          "ManufacturerName": "henry lloyd"
        }
      }
    ]
  }
}

Note: If you intend to make use of synonyms then the best way it to have them in the a text file and add that relative to the config folder location as mentioned here

Hope this helps!

Upvotes: 1

Related Questions