kayen
kayen

Reputation: 4868

Regular Expression to MATCH ALL words in a query, in any order

I'm trying to build a search feature for a project which narrows down items based on a user search input and if it matches the keywords listed against items. For this, I'm saving the item keywords in a data attribute and matching the query with these keywords using a RegExp pattern.

I'm currently using this expression, which I know is not correct and need your help on that:

new RegExp('\\b(' + query + ')', 'gi'))) where query is | separated values of the query entered by the user (e.g. \\b(meat|pasta|dinner)). This returns me a match even if there is only 1 match, say for example - meat

Just to throw some context, here's a small example:

If a user types: meat pasta dinner it should list all items which have ALL the 3 keywords listed against them i.e. meat pasta and dinner. These are independent of the order they're typed in.

Can you help me with an expression which will match ALL words in a query, in any order?

Upvotes: 44

Views: 53265

Answers (4)

Siderite Zackwehdex
Siderite Zackwehdex

Reputation: 6570

stema's answer is technically correct, but it doesn't take performance into account at all. Look aheads are extremely slow (in the context of regular expressions, which are lightning fast). Even with the current logic, the regular expression is not optimal.

So here are some measurements, calculated on larger strings which contain all three words, running the search 1000 times and using four different approaches:

stema's regular expression

/^(?=.*\bmeat\b)(?=.*\bpasta\b)(?=.*\bdinner\b).+/

result: 605ms

optimized regular expression

/^(?=.*?\bmeat\b)(?=.*?\bpasta\b)(?=.*?\bdinner\b)/

uses lazy matching and doesn't need the end all selector

result: 291ms

permutation regular expression

/(\bmeat\b.*?(\bpasta\b.*?\bdinner\b|\bdinner\b.*?\bpasta\b)|\bpasta\b.*?(\bmeat\b.*?\bdinner\b|\bdinner\b.*?\bmeat\b)|\bdinner\b.*?(\bpasta\b.*?\bmeat\b|\bmeat\b.*?\bpasta\b))/

result: 56ms

this is fast because the first pattern is matching, if the last pattern matched, it would be even slower than the look ahead one (300 ms)

array of regular expressions

var regs=[/\bmeat\b/,/\bpasta\b/,/\bdinner\b/];
var result = regs.every(reg=>reg.test(text));

result: 26ms

Note that if the strings are crafted to not match, then the results are:

  • 521ms
  • 220ms
  • 161ms - much slower because it has to go through all the branches
  • 14ms

As you can see, in all cases just using a loop is an order of magnitude faster, not to mention easier to read.

The original question was asking for a regular expression, so my answer to that is the permutation regular expression, but I would not use it, as its size would grow exponentially with the number of search words.

Also, in most cases this performance issue is academic, but it is necessary to be highlighted.

Upvotes: 10

prettyvoid
prettyvoid

Reputation: 3686

Based on the accepted answer I wrote a simple Java method that builds the regex from an array of keywords

public static String regexIfAllKeywordsExists(String[] keywords) {
    StringBuilder sb = new StringBuilder("^");

    for (String keyword : keywords) {
        sb.append("(?=.*\\b");
        sb.append(keyword);
        sb.append("\\b)");
    }

    sb.append(".+");

    return sb.toString();
}

Upvotes: 2

stema
stema

Reputation: 92986

You can achieve this will lookahead assertions

^(?=.*\bmeat\b)(?=.*\bpasta\b)(?=.*\bdinner\b).+

See it here on Regexr

(?=.*\bmeat\b) is a positive lookahead assertion, that ensures that \bmeat\b is somewhere in the string. Same for the other keywords and the .+ is then actually matching the whole string, but only if the assertions are true.

But it will match also on "dinner meat Foobar pasta"

Upvotes: 69

ColBeseder
ColBeseder

Reputation: 3669

your regex looks pretty good:

\b(meat|pasta|dinner)\b

Check that the length of matches equals the number of keywords (in this case, three):

string.match(re).length === numberOfKeywords

where re is the regex with a g flag, string is the data and numberOfKeywords is the number of keywords

This assumes that there are no repeated keywords.

Upvotes: 5

Related Questions