Henry Florence
Henry Florence

Reputation: 2866

MongoDB checking for multiple regex matches inside a list for free text search

I am setting up a mongoDB db to allow (simple) keyword searching using multikeys as recommended here. A record looks similar too:

{ title: { title: "A river runs through", _keywords: ["a","river","runs","through"] ) , ... }

I using nodejs server side, so am using javascript. The following query will match (this was run in the mongo terminal):

> db.torrents_sorted.find({'title._keywords' : {"$all" : ["river","the"]} }).count()
210

However, these do not:

> db.torrents_sorted.find({'title._keywords' : {"$all" : ["/river/i","/the/i"]} }).count()
0

> db.torrents_sorted.find({'title._keywords' : {"$all" : [{ "$regex" : "river", "$options" : "i" },{ "$regex" : "the", "$options" : "i" }]} }).count()
0

Using a single regex (without using $and or $all) does match:

db.torrents_sorted.find({'title._keywords' : { "$regex" : "river", "$options" : "i" } }).count() 1461

Interestingly, using python and pymongo to compile the regular expressions does work:

>>> db.torrents_sorted.find({'title._keywords': { '$all': [re.compile('river'), re.compile('the')]}}).count();
236

I am not necessarily looking for a solution that uses regexes, however it is required that keywords are matched on shorter strings so "riv" matches "river", which seems ideal for regexes (or LIKE in sql).

My next idea is to try passing in a javascript function that performs the regex matching on the list, or perhaps passing in a seperate function for each regex (this does seem to scream hack at me :), although I'm guessing this would be slower and performance is very important.

Upvotes: 4

Views: 4769

Answers (2)

Henry Florence
Henry Florence

Reputation: 2866

Ok, I have an answer, that is kinda interesting in a different way. The bug I was experiencing with regexes exists in version 1.8 of mongodb and has been solved, it is shown here.

Sadly the hosting company looking after the db atm are not able to offer version 2.0, and the $and keyword was added in version 2.0, although thanks for the debug help Samarth.

So instead I have written a javascript function to perform the regex matching:

function () {
  var rs = [RegExp(".*river.*"), RegExp(".*runs.*")];

  for(var j = 0; j < rs.length; j++) {
    var val = false;
    for (var i = 0; !val && i < this.title._keywords.length; i++)
      val = rs[j].test(this.title._keywords[i]);

    if(!val) return false;
  }
  return true;
}

This runs in O(n^2) time (not very cool), but will fail in linear time, if the first regex does not match on any on the keywords (since I am looking for a disjunction).

Any input on optimising this would be greatly appreciated, although if this is the best solution I can find for 1.8, I may have to find somewhere else to store my db in the near future, ;).

Upvotes: 0

Samarth Bhargava
Samarth Bhargava

Reputation: 4228

You might want to use the $and operator.

Upvotes: 2

Related Questions