Reputation: 98
I'm new to ES and trying to figure out on how to use it to detect near duplicates of my documents.
For that I was thinking about using more_like_this
query. The problem I have however is that it always returns 0 hits, no matter what I put in like
parameter.
Here is the code to create index and run the query:
PUT /books
{
"mappings": {
"dynamic": false,
"properties": {
"name": {
"type": "text",
"analyzer": "standard"
},
"author": {
"type": "text",
"analyzer": "standard"
},
"release_date": { "type": "date", "format": "yyyy-MM-dd" },
"page_count": { "type": "integer" }
}
}
}
POST /_bulk
{ "index" : { "_index" : "books" } }
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
{ "index" : { "_index" : "books" } }
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
{ "index" : { "_index" : "books" } }
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
{ "index" : { "_index" : "books" } }
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
{ "index" : { "_index" : "books" } }
{"name": "The Handmaids Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}
GET books/_search
{
"query": {
"more_like_this": {
"fields": ["author"],
"like": "Ray",
"max_query_terms": 10,
"min_term_freq": 1
}
}
}
Response look as follows:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
Upvotes: 0
Views: 20
Reputation: 3680
It's because of min_doc_freq
is set to 5 by default. Try with min_doc_freq: 0
or run the _bulk
command 6 times and then you will see the results.
The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
Upvotes: 0