Linux_cat
Linux_cat

Reputation: 55

How to keep punctuation in elasticsearch's thai tokenizer

I'm working with Elasticsearch 7.17.1 to analyze Thai text. My goal is to tokenize Thai text while also retaining punctuation as separate tokens. However, I've encountered a challenge: the default behavior of most Elasticsearch analyzers, including the Thai tokenizer, is to discard punctuation, and I haven't found a way to configure them to do otherwise.

I attempted to create a custom analyzer in hopes of achieving this, but so far, I've had no success. Below is my latest attempt:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "thai_with_punctuation": {
          "tokenizer": "thai",
          "filter": ["punctuation_filter"]
        }
      },
      "filter": {
        "punctuation_filter": {
          "type": "pattern_capture",
          "preserve_original": true,
          "patterns": [
            "([\\p{Punct}])"
          ]
        }
      }
    }
  }
}

When analyzing text with the custom analyzer:

POST /my_thai_index/_analyze
{
  "analyzer": "thai_with_punctuation",
  "text": "(เปิด) ไม่ เป็น??? ??lol."
}

The response ignores punctuation:

{
    "tokens": [
        {
            "token": "เปิด",
            "start_offset": 1,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "ไม่",
            "start_offset": 7,
            "end_offset": 10,
            "type": "word",
            "position": 1
        },
        {
            "token": "เป็น",
            "start_offset": 11,
            "end_offset": 15,
            "type": "word",
            "position": 2
        },
        {
            "token": "lol",
            "start_offset": 21,
            "end_offset": 24,
            "type": "word",
            "position": 3
        }
    ]
}

Retaining punctuation is crucial for my application because it's a requirement for another part of the system which adjusts the behaviour based on the punctuation information.

Is there a workaround or a different approach to achieve this without creating a custom Elasticsearch plugin?

Upvotes: 1

Views: 89

Answers (1)

Alexandros Patestos
Alexandros Patestos

Reputation: 1

as far as i know, you cannot change the tokenizer of the thai analyzer. One workaround for your requirement would be to retrieve the highlight of the search response and check for punctuation there.

So, you will retrieve all the search results highlights based on the tokens that you are searching for, then you are going to keep those that have punctuation(or whatever your requirement is) and do your validations.

Upvotes: 0

Related Questions