Elasticsearch tokenization for international languages

Question

I wanted to find out how elasticsearch tokens the languages other than english and i tried out the analyze api provided by it. But I cannot understand the output at all. Take for example

GET myindex/_analyze?analyzer=hindi&text="में कहता हूँ और तुम सुनना "

Now in the above text there are 6 words in total so I expect at max 6 tokens( believing that text contains no stop words) but the output is somewhat like this

 {
   "tokens": [
      {
         "token": "2350",
         "start_offset": 3,
         "end_offset": 7,
         "type": "",
         "position": 1
      },
      {
         "token": "2375",
         "start_offset": 10,
         "end_offset": 14,
         "type": "",
         "position": 2
      },
      {
         "token": "2306",
         "start_offset": 17,
         "end_offset": 21,
         "type": "",
         "position": 3
      },
      {
         "token": "2325",
         "start_offset": 25,
         "end_offset": 29,
         "type": "",
         "position": 4
      },
      {
         "token": "2361",
         "start_offset": 32,
         "end_offset": 36,
         "type": "",
         "position": 5
      },
      {
         "token": "2340",
         "start_offset": 39,
         "end_offset": 43,
         "type": "",
         "position": 6
      },
      {
         "token": "2366",
         "start_offset": 46,
         "end_offset": 50,
         "type": "",
         "position": 7
      },
      {
         "token": "2361",
         "start_offset": 54,
         "end_offset": 58,
         "type": "",
         "position": 8
      },
      {
         "token": "2370",
         "start_offset": 61,
         "end_offset": 65,
         "type": "",
         "position": 9
      },
      {
         "token": "2305",
         "start_offset": 68,
         "end_offset": 72,
         "type": "",
         "position": 10
      },
      {
         "token": "2324",
         "start_offset": 76,
         "end_offset": 80,
         "type": "",
         "position": 11
      },
      {
         "token": "2352",
         "start_offset": 83,
         "end_offset": 87,
         "type": "",
         "position": 12
      },
      {
         "token": "2340",
         "start_offset": 91,
         "end_offset": 95,
         "type": "",
         "position": 13
      },
      {
         "token": "2369",
         "start_offset": 98,
         "end_offset": 102,
         "type": "",
         "position": 14
      },
      {
         "token": "2350",
         "start_offset": 105,
         "end_offset": 109,
         "type": "",
         "position": 15
      },
      {
         "token": "2360",
         "start_offset": 113,
         "end_offset": 117,
         "type": "",
         "position": 16
      },
      {
         "token": "2369",
         "start_offset": 120,
         "end_offset": 124,
         "type": "",
         "position": 17
      },
      {
         "token": "2344",
         "start_offset": 127,
         "end_offset": 131,
         "type": "",
         "position": 18
      },
      {
         "token": "2344",
         "start_offset": 134,
         "end_offset": 138,
         "type": "",
         "position": 19
      },
      {
         "token": "2366",
         "start_offset": 141,
         "end_offset": 145,
         "type": "",
         "position": 20
      }
   ]
}

That means instead of six elasticsearch has detected around 20 tokens and all of type NUM(I don't know what's that) I am really confused why this is happening. Can someone enlighten me what is happening. What am I doing doing wrong or where I lack in my understanding?

Olly Cruickshank · Accepted Answer

How are you calling the elasticsearch API - possibly the Hindi characters are getting messed up by your client?

It works okay for me (at least the Hindi chars are appearing in the result) on Linux with curl:

curl -XPOST 'http://localhost:9200/myindex/_analyze?analyzer=hindi&pretty' -d 'में कहता हूँ और तुम सुनना '
{
  "tokens" : [ {
    "token" : "कह",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "",
    "position" : 2
  }, {
    "token" : "हुं",
    "start_offset" : 9,
    "end_offset" : 12,
    "type" : "",
    "position" : 3
  }, {
    "token" : "तुम",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "",
    "position" : 5
  }, {
    "token" : "सुन",
    "start_offset" : 20,
    "end_offset" : 25,
    "type" : "",
    "position" : 6
  } ]
}

Elasticsearch tokenization for international languages

Answers (1)

Related Questions