Reputation: 3093
I have a field ManufacturerName
"ManufacturerName": {
"type": "keyword",
"normalizer" : "keyword_lowercase"
},
And a normalizer
"normalizer": {
"keyword_lowercase": {
"type": "custom",
"filter": ["lowercase"]
}
}
When searching for 'ripcurl' it matches. However when searching for 'rip curl' it doesn't.
How/what would use to concatenate certain words. i.e. 'rip curl' -> 'ripcurl'
Apologies if this is a duplicate, I've spent some time seeking a solution to this.
Upvotes: 1
Views: 1423
Reputation: 8860
You would want to make use of text
field for what you are looking for and get this kind of requirement carried out via Ngram Tokenizer
Below is a sample mapping, query and response:
PUT mysomeindex
{
"mappings": {
"mydocs":{
"properties": {
"ManufacturerName":{
"type": "text",
"analyzer": "my_analyzer",
"fields":{
"keyword":{
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
},
"settings": {
"analysis": {
"normalizer": {
"my_normalizer":{
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [ "synonyms" ]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"synonyms":{
"type": "synonym",
"synonyms" : ["henry loyd, henry loid, henry lloyd => henri lloyd"]
}
}
}
}
}
Notice that the field ManufacturerName
is a multi-field which has both text
type and its sibling keyword
type. That way for exact matches & for aggregation queries you could make use of keyword
field while for this requirement you can make use of text
field.
POST mysomeindex/mydocs/1
{
"ManufacturerName": "ripcurl"
}
POST mysomeindex/mydocs/2
{
"ManufacturerName": "henri lloyd"
}
What elasticsearch does when you ingest the above document is, it creates tokens of size from 3
to 5
length and stored them in inverted index for e.g. `rip, ipc, pcu etc...
You can execute the below query to see what tokens gets created:
POST mysomeindex/_analyze
{
"text": "ripcurl",
"analyzer": "my_analyzer"
}
Also I'd suggest you to look into Edge Ngram tokenizer and see if that fits better for your requirement.
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "rip curl"
}
}
}
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.25316024,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "1",
"_score": 0.25316024,
"_source": {
"ManufacturerName": "ripcurl"
}
}
]
}
}
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "henri lloyd"
}
}
}
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.2784421,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "2",
"_score": 2.2784421,
"_source": {
"ManufacturerName": "henry lloyd"
}
}
]
}
}
Note: If you intend to make use of synonyms then the best way it to have them in the a text file and add that relative to the config
folder location as mentioned here
Hope this helps!
Upvotes: 1