Michal Artazov
Michal Artazov

Reputation: 4648

Mongo regex not matching word's in a string by prefix

I have collection of books in my mongo database

{
    "title": "Some cool title",
    "authors": [ "Author1", "Author2", ... ],
    ...
}

I want to create smart enough search engine for those books. If user types something into the search input, this happens:

  1. Convert input string into array of keywords
  2. Search all documents where at least one keyword matches title or name of any author

Then I do some more magic with it but the thing that I need help with is this - when I say that keyword matches title/author, I mean that it matches some word in the title/author or it's prefix. For example do would match any string that contains do, doing, double in it but not ado or badoo.

I googled it and this should be the right way to do it:

public function searchBooksByKeywords($keywords) {

    array_walk($keywords, function(&$keyword) {
        $keyword = preg_quote($keyword, "/");
    });

    $filter = array(
        '$or'      => [
            [ "title"    => new \MongoRegex("/\\b(" . implode('|', $keywords) . ")/i") ],
            [ "authors"   => new \MongoRegex("/\\b(" . implode('|', $keywords) . ")/i") ],
        ]
    );

    $books = $this->database->Books->find($filter);
    return \iterator_to_array($books);
}

It doesn't work. I still get results like steal for tea, i.e. it matches even substrings inside words, not just prefixes. I'm pretty lost here...

BTW, I use PHP.

EDIT: I found probable cause of the problem. In case of matching inside the word the searched word occurs immidiately after some non-ASCII character (but maybe not all of them), for example I searched for sto and got results like Město & město, for ste it found Kroatien Dalmatinische Küste and Ostseeküste,Darss,Rostock, etc.

Upvotes: 1

Views: 1789

Answers (3)

Michal Artazov
Michal Artazov

Reputation: 4648

I finally found solution. I simply added u flag to the regex.

new \MongoRegex("/\\b(" . implode('|', $keywords) . ")/iu"

PHP Documentation says

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

It can be found here.

Upvotes: 2

user557597
user557597

Reputation:

After looking at your edit, its clear you need to enhance the word boundry to restrict
it to ASCII characters only. There are many ways to do this.

If the first character in a search string/keyword could be between \x80 - \xff then a whole different approach is necessary. Hopefully thats not the case.

 new \MongoRegex("/(?:^|(?<=[\\x00-\\x7f]))(?=[\\x00-\\x7f])\\b(" . implode('|', $keywords) . ")/i")

 # --------------------------------------------
 # Using hex 
 (?:                           # Group start
      ^                             # Beginning of string
   |  (?<= [\x00-\x7f] )            # or, ASCII character behind us
 )                             # Group end
 (?= [\x00-\x7f] )             # ASCII character in front of us
 \b                            # word boundry

 # --------------------------------------------
 # Using Posix 
 (?:                           # Group start
      ^                             # Beginning of string
   |  (?<= [[:ascii:]] )            # or, ASCII character behind us
 )                             # Group end
 (?= [[:ascii:]] )             # ASCII character in front of us
 \b                            # word boundry

Upvotes: 0

Bryan Elliott
Bryan Elliott

Reputation: 4095

Try this:

new \MongoRegex("/\\b(" . implode('|', $keywords) . ").*\\b/i")

EDIT:

As OP mentions in his edit, the above regex fails for keywords containing non-ASCII characters, for example keyword sto matches results like Město & město, for ste it matches Küste,.. etc.

Therefore, in this case, I modified regex as follows:

new \MongoRegex("/(?:^|\\s)(" . implode('|', $keywords) . ")/i")

regex example: http://regex101.com/r/nR9lH6

Upvotes: 1

Related Questions