Sudet
Sudet

Reputation: 98

more-like-this query always return no hits

I'm new to ES and trying to figure out on how to use it to detect near duplicates of my documents.

For that I was thinking about using more_like_this query. The problem I have however is that it always returns 0 hits, no matter what I put in like parameter.

Here is the code to create index and run the query:

PUT /books
{
  "mappings": {
    "dynamic": false,  
    "properties": {  
      "name": { 
        "type": "text",
        "analyzer": "standard"
      },
      "author": { 
        "type": "text",
        "analyzer": "standard"
      },
      "release_date": { "type": "date", "format": "yyyy-MM-dd" },
      "page_count": { "type": "integer" }
    }
  }
}

POST /_bulk
{ "index" : { "_index" : "books" } }
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
{ "index" : { "_index" : "books" } }
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
{ "index" : { "_index" : "books" } }
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
{ "index" : { "_index" : "books" } }
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
{ "index" : { "_index" : "books" } }
{"name": "The Handmaids Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}

GET books/_search
{
  "query": {
    "more_like_this": {
      "fields": ["author"],
      "like": "Ray",
      "max_query_terms": 10,
      "min_term_freq": 1
    }
  }
}

Response look as follows:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

Upvotes: 0

Views: 20

Answers (1)

Musab Dogan
Musab Dogan

Reputation: 3680

It's because of min_doc_freq is set to 5 by default. Try with min_doc_freq: 0 or run the _bulk command 6 times and then you will see the results.

The minimum document frequency below which the terms will be ignored from the input document. Defaults to 5.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html min_doc_freq more_like_this query

Upvotes: 0

Related Questions