Reputation: 39833
I'm trying to create a filter against ElasticSearch that requires more than one match before the result is returned. For example, in the following text:
If you're uneasy at the idea of riding in a vehicle that drives itself, just wait till you see Google's new car. It has no gas pedal, no brake and no steering wheel. Google has been demonstrating its driverless technology for several years by retrofitting Toyotas, Lexuses and other cars with cameras and sensors. But now, for the first time, the company has unveiled a prototype of its own: a cute little car that looks like a cross between a VW Beetle and a golf cart.
If I set the minimum number of matches to 2 and searched for Google
, I would expect this result because Google
appears in the text twice. However, searching on Toyota
with the same number of expected matches should not result in this article.
How do I construct this filter?
Upvotes: 2
Views: 205
Reputation: 27487
Probably not exactly what you are looking for, but you could add explain to your query and then filter on the client side by number of term matches. From the docs, query would look like this:
GET /_search?explain
{
"query" : { "match" : { "tweet" : "honeymoon" }}
}
Results would look like this:
"_explanation": {
"description": "weight(tweet:honeymoon in 0)
[PerFieldSimilarity], result of:",
"value": 0.076713204,
"details": [
{
"description": "fieldWeight in 0, product of:",
"value": 0.076713204,
"details": [
{
"description": "tf(freq=1.0), with freq of:",
"value": 1,
"details": [
{
"description": "termFreq=1.0",
"value": 1
}
]
},
{
"description": "idf(docFreq=1, maxDocs=1)",
"value": 0.30685282
},
{
"description": "fieldNorm(doc=0)",
"value": 0.25,
}
]
}
]
}
You could then filter on the description field for term frequency and look for a value > 1.
I believe you may be able to do this directly (no client side filtering) by using scripting, as you can get reference to term frequency:
Term statistics:
Term statistics for a field can be accessed with a subscript operator like this: _index['FIELD']['TERM']. This will never return null, even if term or field does not exist. If you do not need the term frequency, call _index['FIELD'].get('TERM', 0) to avoid uneccesary initialization of the frequencies. The flag will have only affect is your set the index_options to docs (see mapping documentation).
_index['FIELD']['TERM'].df()
df of term TERM in field FIELD. Will be returned, even if the term is not present in the current document.
_index['FIELD']['TERM'].ttf()
The sum of term frequencys of term TERM in field FIELD over all documents. Will be returned, even if the term is not present in the current document.
_index['FIELD']['TERM'].tf()
tf of term TERM in field FIELD. Will be 0 if the term is not present in the current document.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html
However, I've not done this and there are the normal concerns about both security and performance when using server side scripting.
Upvotes: 1