Elasticsearch: Find duplicates by field

Question

Im working with elasticsearch. I got collection of events, where are event names, for ex. FC Barcelona - Real Madrit, then somewhere in collection may be Footbal Club Barcela - FC Real Madryt.

I need to find minimum 2 hits without query text. I think aggregation and ngram tokenizer should be used here, but I'm not sure.

Here are my index settings:

{
        "settings": {
            "analysis": {
                "analyzer": {
                    "test": {
                        "tokenizer": "test",
                        "filter": ["lowercase", "word_delimiter", "nGram", "porter_stem"]
                        "token_chars": [
                            "letter",
                            "digit",
                            "whitespace"
                        ]
                    }
                },
                "tokenizer": {
                    "test": {
                        "type": "ngram",
                        "min_gram": 3,
                        "max_gram": 15,
                    }
                }
            }
        }
    }

And that's how my current query looks like:

{
  "size": 0,
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "eventName",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

And here is my mapping:

{
            "event": {
                "properties": {
                    "eventName": {
                        "type": "keyword",
                        // fielddata: true
                    }
                }
            }
        }

Could u point me in the right direction, please?

Elasticsearch: Find duplicates by field

Answers (1)

Related Questions