Reputation: 1069
How to create a custom tokenizer without using default built-in token filters?. e.g: Text: "Samsung Galaxy S9" I want to tokenize this text such that it should be indexed like this
["samsung", "galaxy", "s9", "samsung galaxy s9", "samsung s9", "samsung galaxy" , "galaxy s9"]
.
How would I do that?
Upvotes: 2
Views: 259
Reputation: 17735
PUT testindex
{
"settings": {
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 20,
"min_shingle_size": 2,
"output_unigrams": "true"
}
},
"analyzer": {
"analyzer_shingle": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"filter_shingle"
]
}
}
}
},
"mappings": {
"product": {
"properties": {
"title": {
"analyzer": "analyzer_shingle",
"search_analyzer": "standard",
"type": "text"
}
}
}
}
}
POST testindex/product/1
{
"title": "Samsung Galaxy S9"
}
GET testindex/_analyze
{
"analyzer": "analyzer_shingle",
"text": ["Samsung Galaxy S9"]
}
You can find more about the shingles here and here
The first example is great and it covers a lot. If you want to use the standard tokenizer and not the whitespace, then you'll have to take care of the stop words as the blog post describes. Both of the urls are official ES sources
Upvotes: 2