Reputation: 4648
I have collection of books in my mongo database
{
"title": "Some cool title",
"authors": [ "Author1", "Author2", ... ],
...
}
I want to create smart enough search engine for those books. If user types something into the search input, this happens:
Then I do some more magic with it but the thing that I need help with is this - when I say that keyword matches title/author, I mean that it matches some word in the title/author or it's prefix. For example do
would match any string that contains do
, doing
, double
in it but not ado
or badoo
.
I googled it and this should be the right way to do it:
public function searchBooksByKeywords($keywords) {
array_walk($keywords, function(&$keyword) {
$keyword = preg_quote($keyword, "/");
});
$filter = array(
'$or' => [
[ "title" => new \MongoRegex("/\\b(" . implode('|', $keywords) . ")/i") ],
[ "authors" => new \MongoRegex("/\\b(" . implode('|', $keywords) . ")/i") ],
]
);
$books = $this->database->Books->find($filter);
return \iterator_to_array($books);
}
It doesn't work. I still get results like steal
for tea
, i.e. it matches even substrings inside words, not just prefixes. I'm pretty lost here...
BTW, I use PHP.
EDIT: I found probable cause of the problem. In case of matching inside the word the searched word occurs immidiately after some non-ASCII character (but maybe not all of them), for example I searched for sto
and got results like Město & město
, for ste
it found Kroatien Dalmatinische Küste
and Ostseeküste,Darss,Rostock
, etc.
Upvotes: 1
Views: 1789
Reputation: 4648
I finally found solution. I simply added u
flag to the regex.
new \MongoRegex("/\\b(" . implode('|', $keywords) . ")/iu"
PHP Documentation says
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
It can be found here.
Upvotes: 2
Reputation:
After looking at your edit, its clear you need to enhance the word boundry to restrict
it to ASCII characters only. There are many ways to do this.
If the first character in a search string/keyword could be between \x80 - \xff then a whole different approach is necessary. Hopefully thats not the case.
new \MongoRegex("/(?:^|(?<=[\\x00-\\x7f]))(?=[\\x00-\\x7f])\\b(" . implode('|', $keywords) . ")/i")
# --------------------------------------------
# Using hex
(?: # Group start
^ # Beginning of string
| (?<= [\x00-\x7f] ) # or, ASCII character behind us
) # Group end
(?= [\x00-\x7f] ) # ASCII character in front of us
\b # word boundry
# --------------------------------------------
# Using Posix
(?: # Group start
^ # Beginning of string
| (?<= [[:ascii:]] ) # or, ASCII character behind us
) # Group end
(?= [[:ascii:]] ) # ASCII character in front of us
\b # word boundry
Upvotes: 0
Reputation: 4095
Try this:
new \MongoRegex("/\\b(" . implode('|', $keywords) . ").*\\b/i")
EDIT:
As OP mentions in his edit, the above regex fails for keywords containing non-ASCII characters, for example keyword sto
matches results like Město
& město
, for ste
it matches Küste
,.. etc.
Therefore, in this case, I modified regex as follows:
new \MongoRegex("/(?:^|\\s)(" . implode('|', $keywords) . ")/i")
regex example: http://regex101.com/r/nR9lH6
Upvotes: 1