Elasticsearch - Looking for a performant way of must_not with ids

Question

I've the following situation:

We've a product search currently realized with a commercial solution. I'm playing around with Elasticsearch to realize our current product search with Elasticsearch which works very good basically. But we've one speciality. We've a product catalog of about 1 million products, but not every customer is allowed to buy every product. There are many rules defining if a product can be bought by a customer.

It's not just:

Customer A is not allowed to buy products of vendor A

Or:

Customer B is not allowed to buy products of category B of vendor B.

That would be easy.

To get these products which a customer is not allowed to buy we implemented a microservice/webservice years ago. This webservice returns a product blacklist, just a list of product numbers.

The problem with that is, that if I just run a query in Elasticsearch ignoring these blacklisted products I get back products which a customer is not allowed to buy. If I query the top 10 search hits only it could happen, that I'm not allowed to show these products, because a customer is not allowed to buy them. Also if I'm using aggregations for vendors and categories I get back vendors and/or categories a customer is probably not allowed to buy from.

What did I do in my prototype?

Before querying Elasticsearch I request the product blacklist for a certain customer (and cache it of course). After I've received the blacklist I run a query like this:

{
  "query" : {
    "bool" : {
      "must_not" : [
        {
          "ids" : {
            "values" : [

              // Numbers of blacklisted products. Can be thousands!

              1234567,
              1234568,
              1234569,
              1234570,
              ...
            ]
          }
        }
      ],
      "should" : [
        {
        "query" : {
            ...
          }
        ]
      }
    }
  }
  "aggregations" : {
    ...
  }
}

This works very well, but we've customers which have thousands of blacklisted products. Therefore on one hand I'm afraid that the network traffic will be too high and I recognized that the complete Elasticsearch request is remarkably slower. But it depends basically on the amount of black listed products.

My next approach was to develop my own Elasticsearch query builder as plugin, which handles the blacklist stuff inside of Elasticsearch. This blacklist query extends AbstractQueryBuilder and uses a TermInSetQuery. So this query builder requests the blacklist of the given customer once, caches it, and builds a TermInSetQuery with all that blacklisted product numbers.

Now my query look like this:

{
  "query" : {
    "bool" : {
      "must_not" : [
        {
          "blacklist" : {         <-- This is my own query builder
            "customer" : 1234567
          }
        }
      ],
      "should" : [
        {
        "query" : {
            ...
          }
        ]
      }
    }
  }
  "aggregations" : {
    ...
  }
}

This solutation is faster and doesn't have to send the whole list of blacklisted product numbers in the query each time. So I don't have the network overhead. But the query is still remarkably slower than without this blacklist stuff. I profiled this query and I'm not suprized to see, that my blacklist query takes about 80-90% of the runtime.

I think this TermInSetQuery performs very bad in my case. Because I guess the Elasticsearch respective Lucene matching process is quite more than just a:

if (blacklistSet.contains(id)) {
  continue; // ignore the current search hit.
}

Does someone of you have a hint for me, how to implement such a blacklist mechanism more performant?

Is there a way to intercept the Elasticsearch/Lucene query process? Maybe I can write my own real Lucene query instead of using the TermInSetQuery.

Thanks in advance.

Christian

Slomo · Accepted Answer

This is not a solution, but maybe another approach.

First of all, here is an older SO post that might interest you. As far as I know, the more recent versions of Elasticsearch did not introduce/change something better or more suitable.

If you follow the link of the answer to the Terms Query Documentation page, you will find a very simple example.

Now, instead of caching your blacklists, you could create an index and store the blacklist for each customer. You can then use the terms query, and basically reference the blacklist from the other index (=your blacklist cache).

I don't know the frequency of updates on these blacklists, so maybe that could be an issue. Also, you'd have to be careful to not be out of sync. Especially worth mentioning is the fact, that index inserts/updates are by default not immediately visible. So you might need to force refresh when indexing/updating blacklists.

As I said, it may not be a solution. But if it sounds feasible to you, it may be worth a try to compare to your other solutions.

Elasticsearch - Looking for a performant way of must_not with ids

Answers (2)

Test scenario:

Test 1:

Test 2:

Test 3:

Test 4:

Related Questions