Reputation: 41
I've been trying to grok the "More Like This" functionality in ElasticSearch. I've read and re-read the documentation but I'm having trouble understanding why the following behavior occurs.
Basically, I insert three documents, and I try a "More Like This Query" with max_query_terms=1
, expecting that the higher TF-IDF term is used, but that doesn't seem to be the case.
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
"message": "dog barks"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
"message": "cat fur"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
"message": "cat naps"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
"query": {
"more_like_this" : {
"like" : ["cat", "dog"],
"fields" : ["message"],
"minimum_should_match" : 1,
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 1
}
}
}';
"dog barks"
document
"cat naps"
and "cat fur"
documents (Also, see note about determinism below)
In the documentation it mentions
Suppose we wanted to find all documents similar to a given input document. Obviously, the input document itself should be its best match for that type of query. And the reason would be mostly, according to Lucene scoring formula, due to the terms with the highest tf-idf. Therefore, the terms of the input document that have the highest tf-idf are good representatives of that document, and could be used within a disjunctive query (or OR) to retrieve similar documents. The MLT query simply extracts the text from the input document, analyzes it, usually using the same analyzer at the field, then selects the top K terms with highest tf-idf to form a disjunctive query of these terms.
Since I specified max_query_terms = 1
, only the term from the input document with the highest TF-IDF score should be used in the disjunctive query. In this case, the input document has two terms. They have the same term frequency in the input document, but cat appears twice as often in the corpus, so it has a higher document frequency. Therefore, dog
should have a higher TF-IDF score than cat
, and therefore I'd expect that the disjunctive query is just "message":"dog"
and the returned result is the "dog barks"
event.
I'm trying to understand what's going on here. Any help is very greatly appreciated. :)
I tried rerunning this setup a few times. When running the 4 ES commands (3 POST + MLT GET) above following a curl -XDELETE 'http://localhost:9200/samples'
command, sometimes I'd get "cat naps"
and "cat fur"
, but other times I'd get "cat naps"
,"cat fur"
, and "dog barks"
, and a few times I'd even get just "dog barks"
.
Earlier I handwaved and just said what the outputs were for the GET query. Let me be more precise Actual output #1 (happens some of the time):
{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":2,"max_score":0.6931472,"hits":
[{"_index":"samples","_type":"_doc","_id":"UHAoI3IBapDWjHWvsQ0_","_score":0.6931472,"_source":{
"message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"UXAoI3IBapDWjHWvsQ1c","_score":0.2876821,"_source":{
"message": "cat naps"
}}]}}
Actual output #2 (happens some of the time):
{"took":2,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":3,"max_score":0.2876821,"hits":
[{"_index":"samples","_type":"_doc","_id":"VHAtI3IBapDWjHWvvA0B","_score":0.2876821,"_source":{
"message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"U3AtI3IBapDWjHWvuw3l","_score":0.2876821,"_source":{
"message": "dog barks"
}},{"_index":"samples","_type":"_doc","_id":"VXAtI3IBapDWjHWvvA0V","_score":0.2876821,"_source":{
"message": "cat naps"
}}]}}
Actual output #3 (happens most rarely of the three):
{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":1,"max_score":0.9808292,"hits":
[{"_index":"samples","_type":"_doc","_id":"WXAzI3IBapDWjHWvbQ3s","_score":0.9808292,"_source":{
"message": "dog barks"
}}]}}
Maybe elasticsearch is in a weird "processing state" and needs a bit of time between documents. So I gave ES some time between inserting the documents and before running the GET command.
filename="testEsOutput-10-incremental.txt"
amount=10
echo "Test-10-incremental"
for i in {1..10}
do
curl -XDELETE 'http://localhost:9200/samples';
sleep $amount
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
"message": "dog barks"
}';
sleep $amount
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
"message": "cat fur"
}';
sleep $amount
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
"message": "cat naps"
}';
sleep $amount
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
"query": {
"more_like_this" : {
"like" : ["cat", "dog"],
"fields" : ["message"],
"minimum_should_match" : 1,
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 1
}
}
}' >> $filename
echo "\n\r----\n\r" >> $filename
echo "----\n\r" >> $filename
done
echo "Done!"
However this did not seem to affect the non-deterministic output in any meaningful way.
search_type=dfs_query_then_fetch
Following this SO post about ES nondeterminism, I tried adding the dfs_query_then_fetch option, aka
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/?search_type=dfs_query_then_fetch' -d '{
"query": {
"more_like_this" : {
"like" : ["cat", "dog"],
"fields" : ["message"],
"minimum_should_match" : 1,
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 1
}
}
}'
but still, the results were not deterministic and they varied between the three options.
I tried looking at additional debug information via
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_validate/query?rewrite=true' -d '{
"query": {
"more_like_this" : {
"like" : ["cat", "dog"],
"fields" : ["message"],
"minimum_should_match" : 1,
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 1
}
}
}';
but this sometimes output
{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"message:cat"}]}
and other times
{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"like:[cat, dog]"}]}
so the output wasn't even deterministic (running it back to back).
Note: Tested on ElasticSearch 6.8.8, both locally and in online REPL. Also tested by using an actual document, e.g.
curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/72 -d '{
"message" : "dog cat"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
"query": {
"more_like_this" : {
"like" : {
"_id" : "72"
}
,
"fields" : ["message"],
"minimum_should_match" : 1,
"min_term_freq" : 1,
"min_doc_freq" : 1,
"max_query_terms" : 1
}
}
}';
but got the same "cat naps"
and "cat fur"
events.
Upvotes: 3
Views: 572
Reputation: 41
Okay, after much debugging, I tried limiting the index to just one shard, aka
curl -XPUT --header 'Content-Type: application/json' 'http://localhost:9200/samples' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}
}';
When I did this, I got, 100% of the time, only the "dog barks"
document.
It seems that even when using the search_type=dfs_query_then_fetch
option (with a multi-shard index), ES still wasn't doing a perfectly accurate job. I'm not sure what other options I could use to force accurate behavior. Maybe someone else can answer with more insight.
Upvotes: 1