Mangu Singh Rajpurohit
Mangu Singh Rajpurohit

Reputation: 11420

Creating custom stemmer for hinglish in elasticsearch

I am new to elasticsearch. I want to create a custom analyzer in elasticsearch, with custom filters and custom stemmers. I know that ElasticSearch is built upon lucene, and in lucene, custom stemmer support is there. But, I am not able to find any example, which shows custom analyzer/stemmer implementation in lucene and integration of the same in elasticsearch.

Apologizing for bad english. Thanks in advance.

Edit 1

What I want is Hinglish Stemmer, which will transform following inputs to given below outputs:-

Upvotes: 2

Views: 1479

Answers (2)

Mangu Singh Rajpurohit
Mangu Singh Rajpurohit

Reputation: 11420

Finally, after several hiccups, I was finally able to create implementation of hinglish-stemmer. It's available at following link :-

https://github.com/Mangu-Singh-Rajpurohit/hinglish-stemmer/

Upvotes: 3

Adonis
Adonis

Reputation: 4818

I will try to write a simple answer, let me know if you have any question.

First step: Create the custom_stemming file (here "custom_stems.txt"), like the following, and place it into the config folder (I put it under "config/analysis/custom_stems.txt"):

rama => ram
raam => ram
sachin => sachin
sacheen => sachin
sachina => sachin
sacheena => sachin
kuldeep => kuldip
kooldeep => kuldip
kooldipa => kuldip

Create then an index with an adequate mapping (I use the mapping from this example, you can create other analyzer, the only important part here is the "custom_stems" stemmer):

PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "custom_stems"]
                }
            },
            "filter" : {
                "custom_stems" : {
                    "type" : "stemmer_override",
                    "rules_path" : "analysis/custom_stems.txt"
                }
            }
        }
    }
}

Test that it works:

GET /my_index/_analyze
{
  "text": ["Rama"],
  "analyzer": "my_analyzer"
}

You should see in the output:

{
  "tokens": [
    {
      "token": "ram",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

Note that i used:

  • Elasticsearch 5.3.2
  • Kibana 5.0.1

Upvotes: 1

Related Questions