user3181257
user3181257

Reputation: 85

How to check the tokens generated for different tokenizers in Elasticsearch

I have been using different type of tokenizers for test and demonstration purposes. I need to check how a particular text field is tokenized using different tokenizers and also see the tokens generated.

How can I achieve that?

Upvotes: 6

Views: 748

Answers (2)

Vineeth Mohan
Vineeth Mohan

Reputation: 19253

Apart from what @Val have mentioned you can try out the term vector,if you are intending to study the working of tokenisers.You can try out something like this just for examining the tokenisation happening in a field

GET /index-name/type-name/doc-id/_termvector?fields=field-to-be-examined

To know more about tokenisers and their operations you can refer this blog

Upvotes: 3

Val
Val

Reputation: 217254

You can use the _analyze endpoint for this purpose.

For instance, using the standard analyzer, you can analyze this is a test like this

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'this is a test'

And this produces the following tokens:

{
  "tokens" : [ {
    "token" : "this",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "is",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "a",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "test",
    "start_offset" : 10,
    "end_offset" : 14,
    "type" : "<ALPHANUM>",
    "position" : 4
  } ]
}

Of course, you can use any of the existing analyzers and you can also specify tokenizers using the tokenizer parameter, token filters using the token_filtersparameter and character filters using the char_filters parameter. For instance, analyzing the HTML curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>' using the standard analyzer, the keyword tokenizer, the lowercase token filter and the html_strip character filter yields this, i.e. a lowercase single token without the HTML markup:

curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'THIS is a <b>TEST</b>'

{
  "tokens" : [ {
    "token" : "this is a test",
    "start_offset" : 0,
    "end_offset" : 21,
    "type" : "word",
    "position" : 1
  } ]
}

Upvotes: 6

Related Questions