Reputation: 33
Im working with elasticsearch. I got collection of events, where are event names, for ex. FC Barcelona - Real Madrit
, then somewhere in collection may be Footbal Club Barcela - FC Real Madryt
.
I need to find minimum 2 hits without query text. I think aggregation and ngram tokenizer should be used here, but I'm not sure.
Here are my index settings:
{
"settings": {
"analysis": {
"analyzer": {
"test": {
"tokenizer": "test",
"filter": ["lowercase", "word_delimiter", "nGram", "porter_stem"]
"token_chars": [
"letter",
"digit",
"whitespace"
]
}
},
"tokenizer": {
"test": {
"type": "ngram",
"min_gram": 3,
"max_gram": 15,
}
}
}
}
}
And that's how my current query looks like:
{
"size": 0,
"aggs": {
"duplicateNames": {
"terms": {
"field": "eventName",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {}
}
}
}
}
}
And here is my mapping:
{
"event": {
"properties": {
"eventName": {
"type": "keyword",
// fielddata: true
}
}
}
}
Could u point me in the right direction, please?
Upvotes: 2
Views: 5357
Reputation: 1286
You shouldn't need the nGrams
if you are looking for duplicates. You'll want to use the keyword
type like you have. You can use the terms aggregation like you already have.
POST <index_name>/event/_search
{
"size": 0,
"aggs": {
"duplicateNames": {
"terms": {
"field": "eventName",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {}
}
}
}
}
}
The duplicate eventName
will be listed in the duplicateEventNames
aggregation buckets. The document _id
will be in the top hits
in each bucket.
Upvotes: 1