xbmono
xbmono

Reputation: 2316

How to create a subset of documents and execute a query against the subset in Elasticsearch?

So the reason for that is we have an API that receives query parameters from the client and builds an Elasticsearch query. However, depending on the type of user (whether it's financial adviser or investor and etc.) we have to apply more conditions in order to restrict the search. Unfortunately we can't make any change to the structure of the index (i.e. adding extra columns) and that's because the index is not managed by us and our API has no information about the index except the column names that are configurable.

So here is the example. A request is received to search based on 'investorDateOfBirth' and 'financialAdviserId' and because the search is coming from an adviser we are adding this condition programmatically:

financialAdviserId must be '123' (the id of the current user)

So the final query becomes:

{
  "bool" : {
    "must" : [
      {
        "term" : {
          "financialAdviserId" : {
            "value" : "123",
            "boost" : 1.0
          }
        }
      }
    ],
    "should" : [
      {
        "term" : {
          "investorDateOfBirth" : {
            "value" : "1987-11-12",
            "boost" : 1.0
          }
        }
      },
      {
        "term" : {
          "financialAdviserId" : {
            "value" : "123",
            "boost" : 1.0
          }
        }
      }
    ],
    "disable_coord" : false,
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}

As you can see, there are 2 'financialAdviserId', one is built programmatically from the request query parameters and one ('must') is added based on the current user but as you know this will return those with the specified investorDateOfBirth as well as all other items with adviser id as 123 (including those that don't have the same DOB)

So imagine there are 3 records in the index:

| investorDateOfBirth | financialAdviserId | investorId |
| "1987-11-12"        | 123                | 111        |
| "1900-11-12"        | 123                | 222        |
| "1900-11-12"        | 123                | 333        |

For the above query, the result is all the 3 rows and that's not the result we want, however, for the following query it returns only the first row which is the expectation:

{
  "bool" : {
    "must" : [
      {
        "term" : {
          "financialAdviserId" : {
            "value" : "123",
            "boost" : 1.0
          }
        }
      }
    ],
    "should" : [
      {
        "term" : {
          "investorDateOfBirth" : {
            "value" : "1987-11-12",
            "boost" : 1.0
          }
        }
      }
    ],
    "disable_coord" : false,
    "adjust_pure_negative" : true,
    "boost" : 1.0
  }
}

How to tackle this problem? How can we change the first query to get the same result as the second query (which is returning the first row).

Just to let you know, we can't make financialAdviserId not searchable, because there are other entities that can search through those column? Is there a way to create a subset (in our case a subset where financialAdviserId is 123) and then execute the requested query from the client against that subset?

We are using Elasticsearch v5.5.3 in Java 8

Upvotes: 1

Views: 960

Answers (1)

Nikolay Vasiliev
Nikolay Vasiliev

Reputation: 6066

You're almost there. To get the expected behavior you may nest one bool into the other:

{
"bool": {
  "must": [
    {
      "term": {
        "financialAdviserId": {
          "value": "123"
        }
      }
    },
    {
      "bool": {
        "should": [
          {
            "term": {
              "investorDateOfBirth": {
                "value": "1987-11-12"
              }
            }
          },
          {
            "term": {
              "financialAdviserId": {
                "value": "123"
              }
            }
          }
        ]
      }
    }
  ]
}

(I removed the boosts and other details to make the idea more clear.)

Why the first query from the question does not work

Now let me explain why the initial query does not work.

You used must and should in the same instance of bool query. The documented behavior in this case is the following:

should

If the bool query is in a query context and has a must or filter clause then a document will match the bool query even if none of the should queries match.

(This is also why the suggestion to use filter from Federico does not solve the issue.)

So actually the query you applied had the following logical meaning:

    query_restricting_set_of_docs AND (user_query or True)

And you were looking for this instead:

    query_restricting_set_of_docs AND user_query

In your case user_query appears to look like this:

    query_restricting_set_of_docs OR some_other_query

which brings us the final expression:

    query_restricting_set_of_docs AND (
        query_restricting_set_of_docs OR some_other_query
    )

which translates into ES bool query like this:

{
  "bool": {
    "must": [
      {
        ...query_restricting_set_of_docs
      },
      {
        "bool": {
          "should": [
            {
              ...query_restricting_set_of_docs
            },
            {
              ...other_query
            }
          ]
        }
      }
    ]
  }
}

Note about query and filter context

The main difference between filter and query context are:

  • query context computes relevance score and results are not cached
  • filter context does not compute score but the results are cached

The caching part will make searches faster, though without the relevance score you won't be able to show more relevant documents first. In your case you would probably like to put the query_restricting_set_of_docs into the filter context.

To do so you might use the following query:

{
  "bool": {
    "must": [
      {
        "bool": {
          "filter": [
            {
              "term": {
                "financialAdviserId": {
                  "value": "123"
                }
              }
            }
          ]
        }
      },
      {
        "bool": {
          "should": [
            {
              "term": {
                "investorDateOfBirth": {
                  "value": "1987-11-12"
                }
              }
            },
            {
              "term": {
                "financialAdviserId": {
                  "value": "123"
                }
              }
            }
          ]
        }
      }
    ]
  }
}

Here we wrap query_restricting_set_of_docs into another bool with filter thus achieving filter context for the filtering part.

In cases when you have control over your index and you have few different subsets of the index which you want to restrict you may use Filtered Aliases, which will basically add specified filter to all the queries executed against that alias.

Hope that helps!

Upvotes: 3

Related Questions