Reputation: 1078
I am using default tokenizer(standard) for my index in elastic search. and adding documents to it. but standard tokenizer can't split words which having "." dot in it. For example:
POST _analyze
{
"tokenizer": "standard",
"text": "pink.jpg"
}
Gives me the response as:
{
"tokens": [
{
"token": "pink.jpg",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}
The above response showing the whole word in one term. Can we divide it into two terms using "."(dot) operator in standard tokenizer? is any setting in standard tokenizer for this?
Upvotes: 1
Views: 861
Reputation: 374
By using Standard Tokenizer you can't accomplish what you want,But here Letter Tokenizer to help you
POST _analyze
{
"tokenizer": "letter",
"text": "pink.jpg"
}
which produces
{
"tokens": [
{
"token": "pink",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "jpg",
"start_offset": 5,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}
Upvotes: 1
Reputation: 28703
Letter Tokenizer will do what you want, not sure if it is going to cover all your use-cases.
Standard Tokenizer has only one configuration parameter, `max_token_length, which won't be helpful for you case.
Upvotes: 0