shareone2
shareone2

Reputation: 33

Elasticsearch: Find duplicates by field

Im working with elasticsearch. I got collection of events, where are event names, for ex. FC Barcelona - Real Madrit, then somewhere in collection may be Footbal Club Barcela - FC Real Madryt.

I need to find minimum 2 hits without query text. I think aggregation and ngram tokenizer should be used here, but I'm not sure.

Here are my index settings:

{
        "settings": {
            "analysis": {
                "analyzer": {
                    "test": {
                        "tokenizer": "test",
                        "filter": ["lowercase", "word_delimiter", "nGram", "porter_stem"]
                        "token_chars": [
                            "letter",
                            "digit",
                            "whitespace"
                        ]
                    }
                },
                "tokenizer": {
                    "test": {
                        "type": "ngram",
                        "min_gram": 3,
                        "max_gram": 15,
                    }
                }
            }
        }
    }

And that's how my current query looks like:

{
  "size": 0,
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "eventName",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

And here is my mapping:

{
            "event": {
                "properties": {
                    "eventName": {
                        "type": "keyword",
                        // fielddata: true
                    }
                }
            }
        }

Could u point me in the right direction, please?

Upvotes: 2

Views: 5357

Answers (1)

Tim
Tim

Reputation: 1286

You shouldn't need the nGrams if you are looking for duplicates. You'll want to use the keyword type like you have. You can use the terms aggregation like you already have.

POST <index_name>/event/_search
{
  "size": 0,
  "aggs": {
    "duplicateNames": {
      "terms": {
        "field": "eventName",
        "min_doc_count": 2
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

The duplicate eventName will be listed in the duplicateEventNames aggregation buckets. The document _id will be in the top hits in each bucket.

Upvotes: 1

Related Questions