Prakash Thapa
Prakash Thapa

Reputation: 1285

How to concatenate all values and find specific substring in Mongodb?

I have json document like this:

{
  "A": [
    {
      "C": "abc",
      "D": "de"
    },
    {
      "C": "fg",
      "D": "hi"
    }
  ]
}

I would check whether "A" contains string ef or not. first Concatenate all values abcdefghi then search for ef

In XML, XPATH it would be something like: //A[contains(., 'ef')]

Is there any similar query in Mongodb?

Upvotes: 0

Views: 1440

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151122

All options are pretty horrible for this type of search, but there are a few approaches you can take. Please note though that the end case here is likely the best solution, but I present the options in order to illustrate the problem.

If your keys in the array "A" are consistently defined and always contained an array, you would be searching like this:

db.collection.aggregate([
    // Filter the documents containing your parts
    { "$match": {
        "$and": [
            { "$or": [
                { "A.C": /e/ },
                { "A.D": /e/ } 
            ]},
            {"$or": [
                { "A.C": /f/ },
                { "A.D": /f/ }
            ]}
        ]
    }},

    // Keep the original form and a copy of the array
    { "$project": { 
        "_id": { 
            "_id": "$_id", 
            "A": "$A" 
        },
        "A": 1 
    }},

    // Unwind the array
    { "$unwind": "$A" },

    // Join the two fields and push to a single array
    { "$group": {
         "_id": "$_id",
         "joined": { "$push": {
             "$concat": [ "$A.C", "$A.D" ]
         }}
    }},

    // Copy the array
    { "$project": {
        "C": "$joined",
        "D": "$joined"
    }},

    // Unwind both arrays
    { "$unwind": "$C" },
    { "$unwind": "$D" },

    // Join the copies and test if they are the same
    { "$project": {
        "joined": { "$concat": [ "$C", "$D" ] },
        "same": { "$eq": [ "$C", "$D" ] },
    }},

    // Discard the "same" elements and search for the required string
    { "$match": {
        "same": false,
        "joined": { "$regex": "ef" }
    }},

    // Project the origial form of the matching documents
    { "$project": {
        "_id": "$_id._id",
        "A": "$_id.A"
    }}
])

So apart from the horrible $regex matching there are a few hoops to go through in order to get the fields "joined" in order to again search for the string in sequence. Also note the reverse joining that is possible here that could possibly produce a false positive. Currently there would be no simple way to avoid that reverse join or otherwise filter it, so there is that to consider.

Another approach is to basically run everything through arbitrary JavaScript. The mapReduce method can be your vehicle for this. Here you can be a bit looser with the types of data that can be contained in "A" and try to tie in some more conditional matching to attempt to reduce the set of documents you are working on:

db.collection.mapReduce(
    function () {

      var joined = "";

      if ( Object.prototype.toString.call( this.A ) === '[object Array]' ) {
        this.A.forEach(function(doc) {
          for ( var k in doc ) {
            joined += doc[k];
          }
        });
      } else {
        joined = this.A;  // presuming this is just a string
      }

      var id = this._id;
      delete this["_id"];

      if ( joined.match(/ef/) )
        emit( id, this  );

    },
    function(){},    // will not reduce
    { 
        "query": {
            "$or": [
                { "A": /ef/ },
                { "$and": [
                    { "$or": [
                        { "A.C": /e/ },
                        { "A.D": /e/ } 
                    ]},
                    {"$or": [
                        { "A.C": /f/ },
                        { "A.D": /f/ }
                    ]}
                ] }
            ]
        },
        "out": { "inline": 1 }
    }
);

So you can use that with whatever arbitrary logic to search the contained objects. This one just differentiates between "arrays" and presumes otherwise a string, allowing the additional part of the query to just search for the matching "string" element first, and which is a "short circuit" evaluation.

But really at the end of the day, the best approach is to simply have the data present in your document, and you would have to maintain this yourself as you update the document contents:

{
  "A": [
    {
      "C": "abc",
      "D": "de"
    },
    {
      "C": "fg",
      "D": "hi"
    }
  ],
  "search": "abcdefghi"
}

So that is still going to invoke a horrible usage of $regex type queries but at least this avoids ( or rather shifts to writing the document ) the overhead of "joining" the elements in order to effect the search for your desired string.

Where this eventually leads is that a "full blown" text search solution, and that means an external one at this time as opposed to the text search facilities in MongoDB, is probably going to be your best performance option.

Either using the "pre-stored" approach in creating your "joined" field or otherwise where supported ( Solr is one solution that can do this ) have a "computed field" in this text index that is created when indexing document content.

At any rate, those are the approaches and the general point of the problem. This is not XPath searching, not is their some "XPath like" view of an entire collection in this sense, so you are best suited to structuring your data towards the methods that are going to give you the best performance.

With all of that said, your sample here is a fairly contrived example, and if you had an actual use case for something "like" this, then that actual case may make a very interesting question indeed. Actual cases generally have different solutions than the contrived ones. But now you have something to consider.

Upvotes: 1

Related Questions