Suraj Dalvi
Suraj Dalvi

Reputation: 1078

Configuring the standard tokenizer elasticsearch

I am using default tokenizer(standard) for my index in elastic search. and adding documents to it. but standard tokenizer can't split words which having "." dot in it. For example:

POST _analyze
{
  "tokenizer": "standard",
  "text": "pink.jpg"
}

Gives me the response as:

{
  "tokens": [
    {
      "token": "pink.jpg",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

The above response showing the whole word in one term. Can we divide it into two terms using "."(dot) operator in standard tokenizer? is any setting in standard tokenizer for this?

Upvotes: 1

Views: 861

Answers (2)

RCP
RCP

Reputation: 374

By using Standard Tokenizer you can't accomplish what you want,But here Letter Tokenizer to help you

POST _analyze
{
  "tokenizer": "letter",
  "text": "pink.jpg"
}

which produces

{
  "tokens": [
    {
      "token": "pink",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "jpg",
      "start_offset": 5,
      "end_offset": 8,
      "type": "word",
      "position": 1
    }
  ]
}

Upvotes: 1

khachik
khachik

Reputation: 28703

Letter Tokenizer will do what you want, not sure if it is going to cover all your use-cases.

Standard Tokenizer has only one configuration parameter, `max_token_length, which won't be helpful for you case.

Upvotes: 0

Related Questions