ElasticSearch 5.5.0: Finding relevant documents

Question

In ElasticSearch 5.5.0, I was going though “more_like_this” clause but not able to find relevant documents. I have below data in ElasticSearch and “description” field is having huge non-indexed data of size >1 million bytes. Like below I have ten thousand documents. How can I figure out a set of documents which are matching at least 80% with each other:

{
    "_index": "school",
    "_type": "book",
    "_id": "1",
    "_source": {
      "title": "How to drive safely",
      "description": "LOTS OF WORDS...The book is written to help readers about giving driving safety guidelines. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. LONG...."
    }
}

At the end, I am looking for list of document ID’s which have at least 80% matching contents. Possible expected result containing matching document IDs (any format is fine):

[ [1,30, 500, 8000], [2, 40, 199], .... ]

Do I need to write batch and compare each document with all others and build output set?

Please help.

alr · Accepted Answer

The more like this query has a parameter called minimum_should_match, which can be set to 80%. However the max_query_terms parameter also needs to be taken into account here.

Most importantly, this onls works when you index the contents of those fields.

Also, doing this at query time sounds like a really slow operation. You might want to rethink your strategy here and cluster/group documents on index time (something you need to do yourself as this is a very customised thing to do), so that searching becomes fast.

ElasticSearch 5.5.0: Finding relevant documents

Answers (1)

Related Questions