Reputation: 84

How to find similar tags from text using elastic search

I try to use Elastic Search to find most similar tags from text.

For example, I create test_index and insert two documents:

POST test_index/_doc/17
{
  "id": 17,
  "tags": ["it", "devops", "server"]
}

POST test_index/_doc/20
{
  "id": 20,
  "tags": ["software", "hardware"]
}

So, i expect to find "software" tag (text or id) from "I'm using some softwares and applications" text.

I was hoping someone can provide an example on how to do this or at least point me in the right direction.

Thanks.

Upvotes: 0

Answers (2)

ali reza

Reputation: 84

If you search text that has base or root word, Stemming is good way.

If you need to find most similar word(s) from text, Ngram is more suitable way.

If you search exact words of text in word of tags, Shingles is better way.

Upvotes: 0

Kamal Kunjapur

Reputation: 8840

What you are looking for is nothing but a concept called as Stemming. You would need to create a Custom Analyzer and make use of Stemmer Token Filter.

Please find the below mapping, sample documents, query and response:

Mapping:

PUT my_stem_index
{
  "settings": {
      "analysis" : {
          "analyzer" : {
              "my_analyzer" : {
                  "tokenizer" : "standard",
                  "filter" : ["lowercase", "my_stemmer"]
              }
          },
          "filter" : {
              "my_stemmer" : {
                  "type" : "stemmer",
                  "name" : "english"
              }
          }
      }
  },
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "tags":{
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "keyword":{
            "type": "keyword"
          }
        }
      }
    }
  }
}

From comments, it appears that you are using version < 7. For that you may have to add type in it.

PUT my_stem_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "my_stemmer"
               ]
            }
         },
         "filter":{
            "my_stemmer":{
               "type":"stemmer",
               "name":"english"
            }
         }
      }
   },
   "mappings":{
      "_doc":{
         "properties":{
            "id":{
               "type":"keyword"
            },
            "tags":{
               "type":"text",
               "analyzer":"my_analyzer",
               "fields":{
                  "keyword":{
                     "type":"keyword"
                  }
               }
            }
         }
      }
   }
}

Sample Documents:

POST my_stem_index/_doc/17
{
  "id": 17,
  "tags": ["it", "devops", "server"]
}

POST my_stem_index/_doc/20
{
  "id": 20,
  "tags": ["software", "hardware"]
}

POST my_stem_index/_doc/21
{
  "id": 21,
  "tags": ["softwares and applications", "hardwares and storage devices"]
}

Request Query:

POST my_stem_index/_search
{
  "query": {
    "match": {
      "tags": "software"
    }
  }
}

Response:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.5908618,
    "hits" : [
      {
        "_index" : "my_stem_index",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 0.5908618,
        "_source" : {
          "id" : 20,
          "tags" : [
            "software",
            "hardware"
          ]
        }
      },
      {
        "_index" : "my_stem_index",
        "_type" : "_doc",
        "_id" : "21",
        "_score" : 0.35965496,
        "_source" : {
          "id" : 21,
          "tags" : [
            "softwares and applications",             <--- Note this has how `softwares` also was searchable.
            "hardwares and storage devices"
          ]
        }
      }
    ]
  }
}

Notice in response as how both the documents i.e. having _id 20 and 21 appear.

Additional Note:

If you are new to Elasticsearch, I'd suggest spending sometime to understand the concept of Analysis and how Elasticsearch implements the same using Analyzers.

This would help you understand how the document with softwares and applications is also returning when you only query for software and or vice versa.

Hope this helps!

Upvotes: 2