ChrisS
ChrisS

Reputation: 68

Elasticsearch - Looking for a performant way of must_not with ids

I've the following situation:

We've a product search currently realized with a commercial solution. I'm playing around with Elasticsearch to realize our current product search with Elasticsearch which works very good basically. But we've one speciality. We've a product catalog of about 1 million products, but not every customer is allowed to buy every product. There are many rules defining if a product can be bought by a customer.

It's not just:

Customer A is not allowed to buy products of vendor A

Or:

Customer B is not allowed to buy products of category B of vendor B.

That would be easy.

To get these products which a customer is not allowed to buy we implemented a microservice/webservice years ago. This webservice returns a product blacklist, just a list of product numbers.

The problem with that is, that if I just run a query in Elasticsearch ignoring these blacklisted products I get back products which a customer is not allowed to buy. If I query the top 10 search hits only it could happen, that I'm not allowed to show these products, because a customer is not allowed to buy them. Also if I'm using aggregations for vendors and categories I get back vendors and/or categories a customer is probably not allowed to buy from.

What did I do in my prototype?

Before querying Elasticsearch I request the product blacklist for a certain customer (and cache it of course). After I've received the blacklist I run a query like this:

{
  "query" : {
    "bool" : {
      "must_not" : [
        {
          "ids" : {
            "values" : [

              // Numbers of blacklisted products. Can be thousands!

              1234567,
              1234568,
              1234569,
              1234570,
              ...
            ]
          }
        }
      ],
      "should" : [
        {
        "query" : {
            ...
          }
        ]
      }
    }
  }
  "aggregations" : {
    ...
  }
}

This works very well, but we've customers which have thousands of blacklisted products. Therefore on one hand I'm afraid that the network traffic will be too high and I recognized that the complete Elasticsearch request is remarkably slower. But it depends basically on the amount of black listed products.

My next approach was to develop my own Elasticsearch query builder as plugin, which handles the blacklist stuff inside of Elasticsearch. This blacklist query extends AbstractQueryBuilder and uses a TermInSetQuery. So this query builder requests the blacklist of the given customer once, caches it, and builds a TermInSetQuery with all that blacklisted product numbers.

Now my query look like this:

{
  "query" : {
    "bool" : {
      "must_not" : [
        {
          "blacklist" : {         <-- This is my own query builder
            "customer" : 1234567
          }
        }
      ],
      "should" : [
        {
        "query" : {
            ...
          }
        ]
      }
    }
  }
  "aggregations" : {
    ...
  }
}

This solutation is faster and doesn't have to send the whole list of blacklisted product numbers in the query each time. So I don't have the network overhead. But the query is still remarkably slower than without this blacklist stuff. I profiled this query and I'm not suprized to see, that my blacklist query takes about 80-90% of the runtime.

I think this TermInSetQuery performs very bad in my case. Because I guess the Elasticsearch respective Lucene matching process is quite more than just a:

if (blacklistSet.contains(id)) {
  continue; // ignore the current search hit.
}

Does someone of you have a hint for me, how to implement such a blacklist mechanism more performant?

Is there a way to intercept the Elasticsearch/Lucene query process? Maybe I can write my own real Lucene query instead of using the TermInSetQuery.

Thanks in advance.

Christian

Upvotes: 2

Views: 927

Answers (2)

ChrisS
ChrisS

Reputation: 68

thanks for the tips. Actually I wanted to avoid to index the black list information. Therefore I decided to write my own Elasticsearch black list plugin. But the more I think about it the worst I don't like my idea. If I could get rid of my plugin, I wouldn't have to maintain my plugin and it would be easier to move to cloud for example. So, I tried a few things.

Test scenario:

I created a test index with 100,000 documents including the information which customer is not allowed to buy which product. E.g.

{
  "id" : "123456"
  "description" : "My example products",
  ...
  "blacklist" : [ <lots_of_customer_numbers> ]
}

Further I've created a black list index with one document with a black list of 10,000 items to test the terms lookup. (Should represent the black list for one customer.)

I used an existing Elasticsearch installation of version 5.1.2.

Test 1:

Black list ignored. Just a query for a keyword.

  "query" : {
    "bool" : {
      "must" : [
    {
      "multi_match" : {
        "query" : <keyword>,
        "fields" : [
          "_all"
        ]
      }
    }
      ]
    }
  }

Test 2:

Black list taken into accout with must_not and ids plus keyword. (Note: Server and client on the same host. Therefor we don't have the network as a bottleneck.)

  "query" : {
    "bool" : {
      "must" : [
    {
      "multi_match" : {
        "query" : <keyword>,
        "fields" : [
          "_all"
        ]
      }
    }
      ],
      "must_not" : [
    {
      "ids" : {
        "values" : [ <10000_ids> ]
      }
    }
      ]
    }
  }

Test 3:

Black list taken into account with terms lookup plus keyword.

  "query" : {
    "bool" : {
      "must" : [
    {
      "multi_match" : {
        "query" : <keyword>,
        "fields" : [
          "_all"
        ]
      }
    }
      ],
      "must_not" : [
    {
      "terms" : {
        "blacklist" : {
          "index" : "blacklists",
          "type" : "blacklist",
          "id" : "1234567",
          "path" : "items"
        }
      }
    }
      ]
    }
  }

Test 4:

Black list taken into account with must_not and term query within the same index and documents plus keyword.

  "query" : {
    "bool" : {
      "must" : [
    {
      "multi_match" : {
        "query" : <keyword>,
        "fields" : [
          "_all"
        ]
      }
    }
      ],
      "must_not" : [
    {
      "term" : {
        "blackList" : {
          "value" : "1234567"
        }
      }
    }
      ]
    }
  }

I did 1,000 searches for each test scenario. And this is the result:

Test 1: 3,708ms

Test 2: 104,775ms

Test 3: 39,586ms

Test 4: 3,586ms

As you can see test 2 with must_not and ids performs slowest. Test 3 with terms lookup performs about 11 times slower than test 1. Test 4 performs slightly better than test 1.

I'll try if the test 3 scenario is sufficient to my real worlds needs, because realizing this is quite easy. If not I've to go with the test 4 scenario, which would be more complex in my real live scenario.

Thanks a lot, again.

Upvotes: 2

Slomo
Slomo

Reputation: 1234

This is not a solution, but maybe another approach.

First of all, here is an older SO post that might interest you. As far as I know, the more recent versions of Elasticsearch did not introduce/change something better or more suitable.

If you follow the link of the answer to the Terms Query Documentation page, you will find a very simple example.

Now, instead of caching your blacklists, you could create an index and store the blacklist for each customer. You can then use the terms query, and basically reference the blacklist from the other index (=your blacklist cache).

I don't know the frequency of updates on these blacklists, so maybe that could be an issue. Also, you'd have to be careful to not be out of sync. Especially worth mentioning is the fact, that index inserts/updates are by default not immediately visible. So you might need to force refresh when indexing/updating blacklists.

As I said, it may not be a solution. But if it sounds feasible to you, it may be worth a try to compare to your other solutions.

Upvotes: 2

Related Questions