Reputation: 995
Assume that my stored Lucene documents have a regex
stored field that represents a regular expression.
i.e. doc.add(new StringField("regex", "\d{3}[A-G]\d{2}[A-G]\d{2}", Store.YES));
My search input is something like 123D56G89
.
Is there a way to do a reverse matching in my TermQuery and fetch all documents that match the given input?
Coming from RDBMS background, MariaDB has REGEXP
function for this.
Upvotes: 5
Views: 832
Reputation: 1906
If you want to leverage search index features to search many documents in sublinear time, then No, with the information given in your question, there is not a way. You must examine every document in the index and act on each document's stored expression.
A regular expression is essentially a type of program. In general, without being able to reason about specific concepts encoded in the expression, evaluating it requires knowing the full expression, and the engine must actually run it. This means there's no way to generally summarize or categorize the field into a search index to speed up lookup. If you wish to check a string against N regular expressions, you must go through those N regular expressions one by one and check them. At that point, a search index is not providing any benefits to store, fetch, or manage them.
If you are totally fine with "slow" search, and you are dead-set on storing arbitrary expressions in this way, then technically Yes, you can implement a new type of query that treats a field as a regular expression and runs it against the input. I would not consider this a normal use of a search index, but the logic is as possible technically as any other sort of evaluation.
However, maybe you are trying to solve the wrong problem. There could be a better way to represent the concepts that you are currently trying to store as a regular expression. If you can devise a more specific "language" or structure for your matching, then theoretically, you can create an analyzer that can turn that data into field(s) that are indexable & optimizable.
Example: Maybe you just want to use regular expressions to match some ID code (like 1200ABC00012
or 1G021
) based on the number of numbers, then the number of letters, in its prefix. In that case, rather than indexing a regular expression, a better idea would be to index those two numbers instead: the number count and the letter count from the prefix. So if a search string is DG56
, I might search for documents matching a query like numberPrefixWidth:0 letterPrefixWidth:2
. Or for a search string 789FGH4
, my query would be numberPrefixWidth:3 letterPrefixWidth:3
.
Because we have simplified the concepts actually represented in the document, there is no need to look at every document (and basically run a stored program) to find the one that matches. We can use Lucene to do the kind of search it's fast at.
Note: this answer also applies to your RDBMS example. If you are trying to do something in MariaDB like WHERE someSearch REGEXP theRegexpColumn
, the engine has to run over every single row and evaluate it. There is no potential for index-based optimization whatsoever in a design like that. The difference is that Lucene is more special-purpose and doesn't have a language as broad as SQL to be able to easily run such a query without doing some of the work yourself.
Upvotes: 2