Dragomir Ivanov
Dragomir Ivanov

Reputation: 544

Show only matching fields for MongoDB text search

I am new to Mongo, and wanted to implement text search functionality for a Web front-end. I have added all text fields in a collection in the "text" index, so search finds a match in all the fields. Documents may be heavy.

The problem is that when I receive the whole matching documents and not just the matching fields. I want to get only the matching fields along with the document _id, so I can present just a hints in the Web type-ahead, and when the user selects a match, I can load the whole document by the _id.

There is a $project operator, but the problem is that I don't know which of the text fields the match will appear.

Upvotes: 4

Views: 4550

Answers (1)

Markus W Mahlberg
Markus W Mahlberg

Reputation: 20703

After thinking about this a long time, I think it is possible to implement what you want. However, it is not suitable for very large databases and I haven't worked out an incremental approach yet. It lacks stemming and stop words have to be defined manually.

The idea is to use mapReduce to create a collection of search words with references to the document of origin and the field where the search word originated from. Then, for the actual query for the autocompletion is done using a simple aggregation which utilizes an index and therefor should be rather fast.

So we will work with the following three documents

{
  "name" : "John F. Kennedy",
  "address" : "Kenson Street 1, 12345 Footown, TX, USA",
  "note" : "loves Kendo and Sushi"
}

and

{
  "name" : "Robert F. Kennedy",
  "address" : "High Street 1, 54321 Bartown, FL, USA",
  "note" : "loves Ethel and cigars"
}

and

{
  "name" : "Robert F. Sushi",
  "address" : "Sushi Street 1, 54321 Bartown, FL, USA",
  "note" : "loves Sushi and more Sushi"
}

in a collection called textsearch.

The map/reduce stage

What we basically do is that we will process each and every word in one of the three fields, remove stop words and numbers and save each and every word with the document's _id and the field of the occurrence in an intermediate table.

The annotated code:

db.textsearch.mapReduce(
  function() {

    // We need to save this in a local var as per scoping problems
    var document = this;

    // You need to expand this according to your needs
    var stopwords = ["the","this","and","or"];

    // This denotes the fields which should be processed
    var fields = ["name","address","note"];

    // For each field...
    fields.forEach(

      function(field){

        // ... we split the field into single words...
        var words = (document[field]).split(" ");

        words.forEach(

          function(word){
            // ...and remove unwanted characters.
            // Please note that this regex may well need to be enhanced
            var cleaned = word.replace(/[;,.]/g,"")

            // Next we check...
            if(
              // ...wether the current word is in the stopwords list,...
              (stopwords.indexOf(word)>-1) ||

              // ...is either a float or an integer... 
              !(isNaN(parseInt(cleaned))) ||
              !(isNaN(parseFloat(cleaned))) ||

              // or is only one character.
              cleaned.length < 2
            )
            {
              // In any of those cases, we do not want to have the current word in our list.
              return
            }
              // Otherwise, we want to have the current word processed.
              // Note that we have to use a multikey id and a static field in order
              // to overcome one of MongoDB's mapReduce limitations:
              // it can not have multiple values assigned to a key.
              emit({'word':cleaned,'doc':document._id,'field':field},1)

          }
        )
      }
    )
  },
  function(key,values) {

    // We sum up each occurence of each word
    // in each field in every document...
    return Array.sum(values);
  },
    // ..and write the result to a collection
  {out: "searchtst" }
)

Running this will result in the creation of the collection searchtst. If it already existed, all of it's contents will be replaced.

It will look something like this:

{ "_id" : { "word" : "Bartown", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Bartown", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Ethel", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "note" }, "value" : 1 }
{ "_id" : { "word" : "FL", "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "FL", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }, "value" : 1 }
{ "_id" : { "word" : "Footown", "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "address" }, "value" : 1 }
[...]
{ "_id" : { "word" : "Sushi", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "name" }, "value" : 1 }
{ "_id" : { "word" : "Sushi", "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "note" }, "value" : 2 }
[...]

There are a few things to note here. First of all, a word can have multiple occurrences, for example with "FL". However, it may be in different documents, as it is the case here. A word can also have multiple occurrences in a single field of a single document, on the other hand. We will use this to our advantage later.

Second, we have all the fields, most notably the wordfield in a compound index for _id, which should make the coming queries pretty fast. However, this also means the the index will be quite large and – as for all indices – tends to eat up RAM.

The aggregation stage

So we have reduced the list of words. Now we query for a (sub)string. What we need to do is to find all words beginning with the string the user typed in so far, returning a list of words matching that string. In order to be able to do this and to get the results in a form suitable for us, we use an aggregation.

This aggregation should be pretty fast, since all necessary fields to query are part of a compound index.

Here is the annotated aggregation for the case when the user typed in the letter S:

db.searchtst.aggregate(
  // We match case insensitive ("i") as we want to prevent
  // typos to reduce our search results
  { $match:{"_id.word":/^S/i} },
  { $group:{
      // Here is where the magic happens:
      // we create a list of distinct words...
      _id:"$_id.word",
      occurrences:{
        // ...add each occurrence to an array...
        $push:{
          doc:"$_id.doc",
          field:"$_id.field"
        } 
      },
      // ...and add up all occurrences to a score
      // Note that this is optional and might be skipped
      // to speed up things, as we should have a covered query
      // when not accessing $value, though I am not too sure about that
      score:{$sum:"$value"}
    }
  },
  {
    // Optional. See above
    $sort:{_id:-1,score:1}
  }
)

The result of this query looks something like this and should be pretty self-explanatory:

{
  "_id" : "Sushi",
  "occurences" : [
    { "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "note" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "name" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "note" }
  ],
  "score" : 5
}
{
  "_id" : "Street",
  "occurences" : [
    { "doc" : ObjectId("544b7e44fd9270c1492f5834"), "field" : "address" },
    { "doc" : ObjectId("544b9811fd9270c1492f5835"), "field" : "address" },
    { "doc" : ObjectId("544bb320fd9270c1492f583c"), "field" : "address" }
  ],
  "score" : 3
}

The score of 5 for Sushi comes from the fact that the word Sushi occurs twice in the note field of one of the documents. This is intended behavior.

While this may be a poor man's solution, needs to be optimized for the myriads of thinkable use cases and would need a incremental mapReduce to be implemented in order to be halfway useful in production environments, it works as expected. hth.

Edit

Of course, one could drop the $match stage and add an $out stage in the aggregation phase in order to have the results preprocessed:

db.searchtst.aggregate(
  {
    $group:{
      _id:"$_id.word",
      occurences:{ $push:{doc:"$_id.doc",field:"$_id.field"}},
      score:{$sum:"$value"}
     }
   },{
     $out:"search"
   })

Now, we can query the resulting search collection in order to speed things up. Basically you trade real time results for speed.

Edit 2: In case the preprocessing approach is taken, the searchtst collection of the example should be deleted after the aggregation is finished in order to save both disk space and – more important – precious RAM.

Upvotes: 5

Related Questions