S Mev
S Mev

Reputation: 327

Sum nested array in node.js mongodb

I have a schema in mongodb which looks like this.

first_level:[{
    first_item  : String,
    second_level:[{
        second_item: String,
        third_level:[{
            third_item :String,
            forth_level :[{//4th level
                    price               : Number, // 5th level
                    sales_date          : Date, 
                    quantity_sold       : Number
                }]
        }]
    }]
}]

1). I want to add quantity_sold based on matching criteria in first_item, second_item, third_item and sales_date

2). I also want to find average of all the quantity_sold in a specific date.

3). I also want to find average of all the quantity_sold in a specific date with there corresponding prices.

I have been very confused as to how I can go about this, I am coming from sql background so this is quite confusing

Upvotes: 0

Views: 574

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151072

Let's start with a basic disclaimer in that the main body of what answers the problem has already been answered here at Find in Double Nested Array MongoDB. And "for the record" the Double also applies to Triple or Quadrupal or ANY level of nesting as basically the same principle ALWAYS.

The other main point of any answer is also Don't NEST Arrays, since as is explained in that answer as well ( and I've repeated this many times ), whatever reason you "think" you have for "nesting" actually does not give you the advantages that you percieve it will. In fact "nesting" is really just making life far more difficult.

Nested Problems

The main misconception of any translation of a data structure from a "relational" model is just about always interpreted as "add a nested array level" for each and every associated model. What you are presenting here is no exception to this misconception as it very much appears to be "normalized" so that each sub-array contains the related items to it's parent.

MongoDB is a "document" based database, so it pretty much allows you to do this or in fact any data structure content you basically want. That does not however mean the data in such a form is easy to work with or indeed practical for the actual purpose.

Let's fill out the schema with some actual data to demonstrate:

{
  "_id": 1,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-01"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-02"),
                  "quantity": 1
                },
              ]
            },
            { 
              "third_item": "B",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
              ]
            }
          ]
        },
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "B",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
              ]
            }
          ]
        }
      ]
    },
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "B",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
              ]
            }
          ]
        }
      ]
    }
  ]
},
{
  "_id": 2,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 2,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
},
{
  "_id": 3,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "B",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

That's a "little" different from the structure in the question but for demonstration purposes it has the things we need to look at. Mainly there is an array in the document which has items with a sub-array, which in turn has items in a sub-array and so on. The "normalizing" here is of course by the identifiers on each "level" as an "item type" or whatever you actually have.

The core problem is you just want "some" of the data from within these nested arrays, and MongoDB really just wants to return the "document", which means you need to do some manipulation in order just to get to those matching "sub-items".

Even on the issue of "correctly" selecting the document which matches all of these "sub-criteria" requires extensive use of $elemMatch in order to get the correct combination of conditions on each level of array elements. You cannot use straight up "Dot Notation" due to the need for those multiple conditions. Without the $elemMatch statements you don't get the exact "combination" and just get documents where the condition was true on any array element.

As for actually "filtering out the array contents" then that is actually the part of additional difference:

db.collection.aggregate([
  { "$match": {
    "first_level": {
      "$elemMatch": {
        "first_item": "A",
        "second_level": {
          "$elemMatch": {
            "second_item": "A",
            "third_level": {
              "$elemMatch": {
                "third_item": "A",
                "forth_level": {
                  "$elemMatch": {
                    "sales_date": {
                      "$gte": new Date("2018-11-01"),
                      "$lt": new Date("2018-12-01")
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }},
  { "$addFields": {
    "first_level": {
      "$filter": {
        "input": {
          "$map": {
            "input": "$first_level",
            "in": {
              "first_item": "$$this.first_item",
              "second_level": {
                "$filter": {
                  "input": {
                    "$map": {
                      "input": "$$this.second_level",
                      "in": {
                        "second_item": "$$this.second_item",
                        "third_level": {
                          "$filter": {
                            "input": {
                              "$map": {
                                "input": "$$this.third_level",
                                 "in": {
                                   "third_item": "$$this.third_item",
                                   "forth_level": {
                                     "$filter": {
                                       "input": "$$this.forth_level",
                                       "cond": {
                                         "$and": [
                                           { "$gte": [ "$$this.sales_date", new Date("2018-11-01") ] },
                                           { "$lt": [ "$$this.sales_date", new Date("2018-12-01") ] }
                                         ]
                                       }
                                     }
                                   }
                                 } 
                              }
                            },
                            "cond": {
                              "$and": [
                                { "$eq": [ "$$this.third_item", "A" ] },
                                { "$gt": [ { "$size": "$$this.forth_level" }, 0 ] }
                              ]
                            }
                          }
                        }
                      }
                    }
                  },
                  "cond": {
                    "$and": [
                      { "$eq": [ "$$this.second_item", "A" ] },
                      { "$gt": [ { "$size": "$$this.third_level" }, 0 ] }
                    ]
                  }
                }
              }
            }
          }
        },
        "cond": {
          "$and": [
            { "$eq": [ "$$this.first_item", "A" ] },
            { "$gt": [ { "$size": "$$this.second_level" }, 0 ] }
          ]
        } 
      }
    }
  }},
  { "$unwind": "$first_level" },
  { "$unwind": "$first_level.second_level" },
  { "$unwind": "$first_level.second_level.third_level" },
  { "$unwind": "$first_level.second_level.third_level.forth_level" },
  { "$group": {
    "_id": {
      "date": "$first_level.second_level.third_level.forth_level.sales_date",
      "price": "$first_level.second_level.third_level.forth_level.price",
    },
    "quantity_sold": {
      "$avg": "$first_level.second_level.third_level.forth_level.quantity"
    } 
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quanity_sold": "$quantity_sold"
      }
    },
    "quanity_sold": { "$avg": "$quantity_sold" }
  }}
])

This is best described as "messy" and "involved". Not only is our initial query for document selection with the $elemMatch more than a mouthful, but then we have the subsequent $filter and $map processing for every array level. As mentioned earlier, this is the pattern no matter how many levels there actually are.

You could alternately do a $unwind and $match combination instead of filtering the arrays in place, but this does cause additional overhead to $unwind before the unwanted content is removed, so in modern releases of MongoDB it's generally better practice to $filter from the array first.

The end place here is you want to $group by elements that are actually inside the array, so you end up needing to $unwind each level of the arrays anyway before this.

The actual "grouping" is then generally straightforward using the sales_date and price properties for the first accumulation, and then adding a subsequent stage to $push the different price values you want to accumulate an average for within each date as a second accumulation.

NOTE: The actual handling of dates may well vary in practical use depending on the granularity at which you store them. In this sample the dates are all just already rounded to the start of each "day". If you actually need to accumulate real "datetime" values, then you probably really want a construct like this or similar:

{ "$group": {
  "_id": {
    "date": {
      "$dateFromParts": {
        "year": { "$year": "$first_level.second_level.third_level.forth_level.sales_date" },
        "month": { "$month": "$first_level.second_level.third_level.forth_level.sales_date" },
        "day": { "$dayOfMonth": "$first_level.second_level.third_level.forth_level.sales_date" }
      }
    }.
    "price": "$first_level.second_level.third_level.forth_level.price"
  }
  ...
}}

Using $dateFromParts and other date aggregation operators to extract the "day" information and present the date back in that form for accumulation.

Starting to Denormalize

What should be clear from the "mess" above is that working with nested arrays is not exactly easy. Such structures were generally not even possible to atomically update in releases prior to MongoDB 3.6, and even if you never even updated them or lived with replacing basically the whole array, they still are not simple to query. This is what you are being shown.

Where you must have array content within a parent document it is generally advised to "flatten" and "denormalize" such structures. This may appear contrary to relational thinking, but it's actually the best way to handle such data for performance reasons:

{
  "_id": 1,
  "data": [
    {
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },

    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-01"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-02"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "B",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },
    {
     "first_item": "A",
     "second_item": "A",
     "third_item": "B",
     "price": 1,
     "sales_date": new Date("2018-11-03"),
     "quantity": 1
    },
    {
      "first_item": "A",
      "second_item": "B",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
     },
  ]
},
{
  "_id": 2,
  "data": [
    {
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 2,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
    }
  ]
},
{
  "_id": 3,
  "data": [
    {
      "first_item": "A",
      "second_item": "B",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
     }
  ]
}

That's all the same data as originally shown, yet instead of nesting we actually just put everything into a singular flattened array within each parent document. Sure this means duplication of various data points, but the difference in query complexity and performance should be self evident:

db.collection.aggregate([
  { "$match": {
    "data": {
      "$elemMatch": {
        "first_item": "A",
        "second_item": "A",
        "third_item": "A",
        "sales_date": {
          "$gte": new Date("2018-11-01"),
          "$lt": new Date("2018-12-01")
        }
      }
    }
  }},
  { "$addFields": {
    "data": {
      "$filter": {
        "input": "$data",
         "cond": {
           "$and": [
             { "$eq": [ "$$this.first_item", "A" ] },
             { "$eq": [ "$$this.second_item", "A" ] },
             { "$eq": [ "$$this.third_item", "A" ] },
             { "$gte": [ "$$this.sales_date", new Date("2018-11-01") ] },
             { "$lt": [ "$$this.sales_date", new Date("2018-12-01") ] }
           ]
         }
      }
    }
  }},
  { "$unwind": "$data" },
  { "$group": {
    "_id": {
      "date": "$data.sales_date",
      "price": "$data.price",
    },
    "quantity_sold": { "$avg": "$data.quantity" }
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quantity_sold": "$quantity_sold"
      }
    },
    "quantity_sold": { "$avg": "$quantity_sold" }
  }}
])

Now instead of nesting those $elemMatch calls and similarly for the $filter expressions, everything is much clearer and easy to read and really quite simple in processing. There is another advantage in that you actually can even index the keys of the elements in the array as used in the query. That was a constraint of the nested model where MongoDB simply won't allow such "Multikey indexing" on keys of arrays within arrays. With a single array this is allowed and can be used to improve performance.

Everything after the "array content filtering" then remains exactly the same, with the exception it's just path names like "data.sales_date" as opposed to the long winded "first_level.second_level.third_level.forth_level.sales_date" from the previous structure.

When NOT to Embed

Finally the other big misconception is is that ALL Relations need to be translated as embedding within arrays. This really was never the intent of MongoDB and you were only ever meant to keep "related" data within the same document in an array in the case where it meant doing a single retrieval of data as opposed to "joins".

The classic "Order/Details" model here typically applies where in the modern world you want to display "header" for an "Order" with details such as customer address, order total and so on within the same "screen" as the details of different line items on the "Order".

Way back in the inception of the RDBMS, the typical 80 character by 25 line screen simply had such "header" information on one screen, then the detail lines for everything purchased was on a different screen. So naturally there was some level of common sense to store those in separate tables. As the world moved to more detail on such "screens" you typically want to see the whole thing, or at least the "header" and the first so many lines of such an "order".

Hence why this sort of arrangement makes sense to put into an array, since MongoDB returns a "document" containing the related data all at once. No need for separate requests for separate rendered screens and no need for "joins" on such data since it's already "pre-joined" as it were.

Consider if you need it - AKA "Fully" Denormalize

So in cases where you pretty much know you're not actually interested in dealing with most of the data in such arrays most of the time, it generally makes more sense to simply put it all in one collection on it's own with merely another property in order to identify the "parent" should such "joining" be occasionally required:

{
  "_id": 1,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{ 
  "_id": 2,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-01"),
  "quantity": 1
},
{ 
  "_id": 3,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-02"),
  "quantity": 1
},
{ 
  "_id": 4,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "B",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{
  "_id": 5,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "B",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 6,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "B",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 7,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 2,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{ 
  "_id": 8,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{ 
  "_id": 9,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 10,
  "parent_id": 3,
  "first_item": "A",
  "second_item": "B",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
}

Again it's the same data, but just this time in completely separate documents with a reference to the parent at best in the case where you might actually need it for another purpose. Note that the aggregations here all do not relate to the parent data at all and it's also clear where the additional performance and removed complexity come in by simply storing in a separate collection:

db.collection.aggregate([
  { "$match": {
    "first_item": "A",
    "second_item": "A",
    "third_item": "A",
    "sales_date": {
      "$gte": new Date("2018-11-01"),
      "$lt": new Date("2018-12-01")
    }
  }},
  { "$group": {
    "_id": {
      "date": "$sales_date",
      "price": "$price"
    },
    "quantity_sold": { "$avg": "$quantity" }
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quantity_sold": "$quantity_sold"
      }
    },
    "quantity_sold": { "$avg": "$quantity_sold" }
  }}
])

Since everything is already a document there is no need to "filter down arrays" or have any of the other complexity. All you are doing is selecting the matching documents and aggregating the results, with exactly the same two final steps that have been present all along.

For the purpose of just getting to the final results, this performs far better than either above alternative. The query in question is really only concerned with the "detail" data, therefore the best course of action is to separate the detail from the parent completely as it's always going to provide the best performance benefit.

And the overall point here is where the actual access pattern of the rest of the application NEVER needs to return the entire array content, then it probably should not have been embedded anyway. Seemingly most "write" operations should similarly never need to touch the related parent anyway, and that's another deciding factor where this works or does not.

Conclusion

The general message is again that as a general rule you should never nest arrays. At most you should keep a "singular" array with partially denormalized data within the related parent document, and where the remaining access patterns really do not use the parent and child in tandem much at all, then the data really should be separated.

The "big" change is that all the reasons why you think normalizing data is actually good, turns out to be the enemy of such embedded document systems. Avoiding "joins" is always good, but creating complex nested structure to have the appearance of "joined" data never really works out for your benefit either.

The cost of dealing with what you "think" is normalization usually ends up outwaying the additional storage and maintenance of duplicated and denormalized data within your eventual storage.

Also note that all forms above return the same result set. It's pretty derivative in that the sample data for brevity only includes singular items, or at most where there are multiple price points the "average" is still 1 since that's what all the values are anyway. But the content to explain this is already exceedingly long so it's really just "by example":

{
        "_id" : ISODate("2018-11-01T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}
{
        "_id" : ISODate("2018-11-02T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}
{
        "_id" : ISODate("2018-11-03T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                },
                {
                        "price" : 2,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}

Upvotes: 1

Related Questions