Reputation: 53839

Queries equality in Solr / Lucene

The problem:

Trying to recognize that 2 different queries are actually the same.

For example:

field1:[1 TO 3] OR field1:5

is actually the same query as:

field1:5 OR field1:1 OR field1:3 OR field1:2

The idea:

Is there any way to normalize a query to some kind of canonical form so that after being normalized, a simple string comparison will do the trick?

For example, with the above example, both queries could become:

field1:1 OR field1:2 OR field1:3 OR field1:5

And then can be simply compare to determine whether they are equal.

Or maybe there actually exists some kind of service that is able to determine if two queries are equal. I could not find any.

Thanks for helping.

Upvotes: 0

Answers (1)

femtoRgon

Reputation: 33351

The main problem is those really aren't identical.

field1:[1 TO 3] is a range query, and it may represent a lexicographic range on the field, in which case it would match field1:2abcde, or may represent a numeric range on a floating point field, in which case it would match field1:1.234. The other query, field1:1 field1:2 field1:3, can only match the three specified values, so neither of those two examples would be matched.

Also, since fields may be multi-valued, more than one of field1:1 field1:2 field1:3 may have a match in the same document, which would make the scoring of each different.

To consider a simpler case though, how about two queries we can be reasonably certain are identical, like:

field2:this field1:that
field1:that field2:this

Those are certainly identical, at least to the StandardQueryParser!

Once you have run the queries through the query parser, you'll have a Query. Transforming the final query back to a string doesn't tend to work well, since query parser syntax isn't capable of expressing any type of query object (Query.toString() is best used for debugging, really).

So you'll need to compare Query objects.

The output of Query.rewrite() would be the most readily comparable, I believe. This will provide you a set of primitive queries to dig into. This will provide the needed TermQueries for the range query, so it gets past the issues related to the initial query not knowing the field contents.

Neither Query nor IndexReader implement any form of direct comparison between queries. As far as I know, you would need to provide the comparator. This will involve comparing an arbitrarily complex nested set of primitive queries (primitive queries include: BooleanQuery, ConstantScoreQuery, CustomScoreQuery, DisjunctionMaxQuery, FilteredQuery, MatchAllDocsQuery, MultiPhraseQuery, MultiTermQuery, PhraseQuery, SpanQuery, TermQuery, ValueSourceQuery)

Really the question is not whether the queries themselves are inherantly identical, we've established they aren't. The more meaningful question, I think, is are they identical with regards to the data in the index. That in mind, a much simpler implementation would be to search with each query, and compare the doc numbers (and possibly scores?) in each result set (TopDocs).

Upvotes: 1

Queries equality in Solr / Lucene

The problem:

The idea:

Answers (1)

Related Questions