Jakub M.
Jakub M.

Reputation: 41

Why is "More Like This" in ElasticSearch not respecting TF-IDF order for a single term?

I've been trying to grok the "More Like This" functionality in ElasticSearch. I've read and re-read the documentation but I'm having trouble understanding why the following behavior occurs.

Basically, I insert three documents, and I try a "More Like This Query" with max_query_terms=1, expecting that the higher TF-IDF term is used, but that doesn't seem to be the case.

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "dog barks"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat fur"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat naps"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

Expected output:

"dog barks" document

Actual output:

"cat naps" and "cat fur" documents (Also, see note about determinism below)

Explanation for expected output:

In the documentation it mentions

Suppose we wanted to find all documents similar to a given input document. Obviously, the input document itself should be its best match for that type of query. And the reason would be mostly, according to Lucene scoring formula, due to the terms with the highest tf-idf. Therefore, the terms of the input document that have the highest tf-idf are good representatives of that document, and could be used within a disjunctive query (or OR) to retrieve similar documents. The MLT query simply extracts the text from the input document, analyzes it, usually using the same analyzer at the field, then selects the top K terms with highest tf-idf to form a disjunctive query of these terms.

Since I specified max_query_terms = 1, only the term from the input document with the highest TF-IDF score should be used in the disjunctive query. In this case, the input document has two terms. They have the same term frequency in the input document, but cat appears twice as often in the corpus, so it has a higher document frequency. Therefore, dog should have a higher TF-IDF score than cat, and therefore I'd expect that the disjunctive query is just "message":"dog" and the returned result is the "dog barks" event.

I'm trying to understand what's going on here. Any help is very greatly appreciated. :)

Note about Determinism

I tried rerunning this setup a few times. When running the 4 ES commands (3 POST + MLT GET) above following a curl -XDELETE 'http://localhost:9200/samples' command, sometimes I'd get "cat naps" and "cat fur", but other times I'd get "cat naps","cat fur", and "dog barks", and a few times I'd even get just "dog barks".

Full output

Earlier I handwaved and just said what the outputs were for the GET query. Let me be more precise Actual output #1 (happens some of the time):

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":2,"max_score":0.6931472,"hits":
[{"_index":"samples","_type":"_doc","_id":"UHAoI3IBapDWjHWvsQ0_","_score":0.6931472,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"UXAoI3IBapDWjHWvsQ1c","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

Actual output #2 (happens some of the time):

{"took":2,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":3,"max_score":0.2876821,"hits":
[{"_index":"samples","_type":"_doc","_id":"VHAtI3IBapDWjHWvvA0B","_score":0.2876821,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"U3AtI3IBapDWjHWvuw3l","_score":0.2876821,"_source":{
   "message": "dog barks"
}},{"_index":"samples","_type":"_doc","_id":"VXAtI3IBapDWjHWvvA0V","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

Actual output #3 (happens most rarely of the three):

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":1,"max_score":0.9808292,"hits":
[{"_index":"samples","_type":"_doc","_id":"WXAzI3IBapDWjHWvbQ3s","_score":0.9808292,"_source":{
   "message": "dog barks"
}}]}}

Tried spacing out insertions and MLT more

Maybe elasticsearch is in a weird "processing state" and needs a bit of time between documents. So I gave ES some time between inserting the documents and before running the GET command.

filename="testEsOutput-10-incremental.txt"
amount=10
echo "Test-10-incremental"
for i in {1..10}
do
    curl -XDELETE 'http://localhost:9200/samples';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "dog barks"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat fur"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat naps"
    }';
    sleep $amount

    curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }' >> $filename
    echo "\n\r----\n\r" >> $filename
    echo "----\n\r" >> $filename
done
echo "Done!"

However this did not seem to affect the non-deterministic output in any meaningful way.

Tried search_type=dfs_query_then_fetch

Following this SO post about ES nondeterminism, I tried adding the dfs_query_then_fetch option, aka

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/?search_type=dfs_query_then_fetch' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }'

but still, the results were not deterministic and they varied between the three options.

Additional Notes

I tried looking at additional debug information via

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_validate/query?rewrite=true' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

but this sometimes output

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"message:cat"}]}

and other times

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"like:[cat, dog]"}]}

so the output wasn't even deterministic (running it back to back).

Note: Tested on ElasticSearch 6.8.8, both locally and in online REPL. Also tested by using an actual document, e.g.

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/72 -d '{
   "message" : "dog cat"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : {
                "_id" : "72"
            }
            ,
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

but got the same "cat naps" and "cat fur" events.

Upvotes: 3

Views: 572

Answers (1)

Jakub M.
Jakub M.

Reputation: 41

Okay, after much debugging, I tried limiting the index to just one shard, aka

curl -XPUT --header 'Content-Type: application/json' 'http://localhost:9200/samples' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 0 
        }
    }
}';

When I did this, I got, 100% of the time, only the "dog barks" document.

It seems that even when using the search_type=dfs_query_then_fetch option (with a multi-shard index), ES still wasn't doing a perfectly accurate job. I'm not sure what other options I could use to force accurate behavior. Maybe someone else can answer with more insight.

Upvotes: 1

Related Questions