Reputation: 21
I just start learning Elasticsearch. My data has the company name and its website, and I have a list which contains all the domain aliases of a company. I am trying to write a query which can boost the record with the same website in the list.
My data looks like:
{"company_name": "Kaiser Permanente",
"website": "http://www.kaiserpermanente.org"},
{"company_name": "Kaiser Permanente - Urgent Care",
"website": "http://kp.org"}.
The list of domain aliases is:
["kaiserpermanente.org","kp.org","kpcomedicare.org", "kp.com"]
The actual list is longer than the above example. I've tried this query:
{
"bool": {
"should": {
"terms": {
"website": [
"kaiserpermanente.org",
"kp.org",
"kpcomedicare.org",
"kp.com"
],
"boost": 20
}
}
}
}
The query doesn't return anything because "terms" query is the exact match. The domain in the list and the url is similar but not the same.
What I except is the query should return the two records in my example. I think "match" can work, but I couldn't figure out how to match a value with any similar value in the list.
I found a similar question How to do multiple "match" or "match_phrase" values in ElasticSearch. The solution works but my alias list contains more than 50 elements. It would be very verbose if I wrote multiple "match_phrase" for each element. Is there a more efficient way like "terms" so that I could just pass in a list?
I'd appreciate if anyone can help me out with this, thanks!
Upvotes: 1
Views: 1377
Reputation: 1166
What you are observing has been covered in many stackoverflow posts / ES docs - the difference between terms
and match
. When you store that info, I assume you are using the standard
analyzer. This means when you push "http://kp.org", Elasticsearch indexes [ "http", "kp", "org" ]
tokens broken out. However, when you use terms
, it looks for "kp.org" but there was no such "kp.org" token to find matches for since that was broken down by the analyzer when indexing. match
, however, will break down what you query for so that "kp.org" => [ "kp", "org" ]
and it is able to find one or both. Phrase matching just requires the tokens to be next to each other which is probably necessary for what you need.
Unfortunately, there does not appear to be such an option that works like match
but allows many values to match against like terms
. I believe you have three options:
programmatically generate the query as described in the stackoverflow post that you referenced, which you noted would be verbose, but I think this might be just ok unless you have 1k aliases.
analyze the website
field so that analysis transforms "http://www.kaiserpermanente.org" => "kaiserpermanente.org" and "http://kp.org" => "kp.org" for indexing. With this index time analysis approach, when querying, you can successfully use the terms
filter. This might be fine given urls are structured and the use cases you outline only appear to be concerned with domains. If you do this, use multi fields to analyze one website value in multiple ways. It's nice to have Elasticsearch do this kind of work for you and not worry about it in your own code.
do this processing beforehand (before pushing data to ES) so that when you store data in elasticsearch, you store not only the website field, but also a domain, paths, and whatever else you need that you calculated beforehand. You get control at the cost of effort you have to put in.
Upvotes: 1