Get most recent Sub-Document from Array

Question

I have an array. I would like to select the object with the highest revision number from my history arrays (plural).

My document looks like this (often it will be more than just one object in uploaded_files):

{
    "_id" : ObjectId("5935a41f12f3fac949a5f925"),
    "project_id" : 13,
    "updated_at" : ISODate("2017-07-02T22:11:43.426Z"),
    "created_at" : ISODate("2017-06-05T18:34:07.150Z"),
    "owner" : ObjectId("591eea4439e1ce33b47e73c3"),
    "name" : "Demo project",
    "uploaded_files" : [ 
        {
            "history" : [ 
                {
                    "file" : ObjectId("59596f9fb6c89a031019bcae"),
                    "revision" : 0
                }
            ],
            "_id" : ObjectId("59596f9fb6c89a031019bcaf")
            "display_name" : "Example filename.txt"
        }
    ]
}

My code which selects the document:

function getProject(req, projectId) {
    let populateQuery = [
        {path: 'owner'},
        {path: 'uploaded_files.history.file'}
    ]
    return new Promise(function (resolve, reject) {
        Project.findOne({ project_id: projectId }).populate(populateQuery).then((project) => {
            if (!project)
                reject(new createError.NotFound(req.path))
            resolve(project)
        }).catch(function (err) {
            reject(err)
        })
    })
}

How can I select the document so that it only outputs the object with the highest revision number from the history arrays?

Neil Lunn · Accepted Answer

You could tackle this in a couple of different ways. They vary on approach and performance of course, and I think there are some larger considerations you need to make to your design. Most notably here is the "need" for "revisions" data in the usage pattern of your actual application.

Query via aggregate

As for the foremost point of getting the "last element from the inner array", then you really should be using an .aggregate() operation to do this:

function getProject(req,projectId) {

  return new Promise((resolve,reject) => {
    Project.aggregate([
      { "$match": { "project_id": projectId } },
      { "$addFields": {
        "uploaded_files": {
          "$map": {
            "input": "$uploaded_files",
            "as": "f",
            "in": {
              "latest": {
                "$arrayElemAt": [
                  "$$f.history",
                  -1
                ]
              },
              "_id": "$$f._id",
              "display_name": "$$f.display_name"
            }
          }
        }
      }},
      { "$lookup": {
        "from": "owner_collection",
        "localField": "owner",
        "foreignField": "_id",
        "as": "owner"
      }},
      { "$unwind": "$uploaded_files" },
      { "$lookup": {
         "from": "files_collection",
         "localField": "uploaded_files.latest.file",
         "foreignField": "_id",
         "as": "uploaded_files.latest.file"
      }},
      { "$group": {
        "_id": "$_id",
        "project_id": { "$first": "$project_id" },
        "updated_at": { "$first": "$updated_at" },
        "created_at": { "$first": "$created_at" },
        "owner" : { "$first": { "$arrayElemAt": [ "$owner", 0 ] } },
        "name":  { "$first": "$name" },
        "uploaded_files": {
          "$push": {
            "latest": { "$arrayElemAt": [ "$$uploaded_files", 0 ] },
            "_id": "$$uploaded_files._id",
            "display_name": "$$uploaded_files.display_name"
          }
        }
      }}
    ])
    .then(result => {
      if (result.length === 0)
        reject(new createError.NotFound(req.path));
      resolve(result[0])
    })
    .catch(reject)
  })
}

Since this is an aggregation statement where we can also do the "joins" on the "server" as opposed to making additional requests ( which is what .populate() actually does here ) by using $lookup, I'm taking some liberty with the actual collection names since your schema is not included in the question. That's okay, since you did not realize you could in fact do it this way.

Of course the "actual" collection names are required by the server, which has no concept of the "application side" defined schema. There are things you can do for convenience here, but more on that later.

You should also note that depending on where projectId actually comes from, then unlike regular mongoose methods such as .find() the $match will require actually "casting" to an ObjectId if the input value is in fact a "string". Mongoose cannot apply "schema types" in an aggregation pipeline, so you might need to do this yourself, especially if projectId came from a request parameter:

  { "$match": { "project_id": Schema.Types.ObjectId(projectId) } },

The basic part here is where we use $map to iterate through all of the "uploaded_files" entries, and then simply extract the "latest" from the "history" array with $arrayElemAt using the "last" index, which is -1.

That should be reasonable since it's most likely that the "most recent revision" is in fact the "last" array entry. We could adapt this to look for the "biggest", by applying $max as a condition to $filter. So that pipeline stage becomes:

     { "$addFields": {
        "uploaded_files": {
          "$map": {
            "input": "$uploaded_files",
            "as": "f",
            "in": {
              "latest": {
                "$arrayElemAt": [
                   { "$filter": {
                     "input": "$$f.history.revision",
                     "as": "h",
                     "cond": {
                       "$eq": [
                         "$$h",
                         { "$max": "$$f.history.revision" }
                       ]
                     }
                   }},
                   0
                 ]
              },
              "_id": "$$f._id",
              "display_name": "$$f.display_name"
            }
          }
        }
      }},

Which is more or less the same thing, except we do the comparison to the $max value, and return only "one" entry from the array making the index to return from the "filtered" array the "first" position, or 0 index.

As for other general Techniques on using $lookup in place of .populate(), see my entry on "Querying after populate in Mongoose" which talks a bit more about things that can be optimized when taking this approach.

Query via populate

Also of course we can do ( even though not as efficiently ) the same sort of operation using .populate() calls and manipulating the resulting arrays:

Project.findOne({ "project_id": projectId })
  .populate(populateQuery)
  .lean()
  .then(project => {
    if (project === null) 
      reject(new createError.NotFound(req.path));

      project.uploaded_files = project.uploaded_files.map( f => ({
        latest: f.history.slice(-1)[0],
        _id: f._id,
        display_name: f.display_name
      }));

     resolve(project);
  })
  .catch(reject)

Where of course you are actually returning "all" of the items from "history", but we simply apply a .map() to invoke the .slice() on those elements to again get the last array element for each.

A bit more overhead since all the history is returned, and the .populate() calls are additional requests, but it does get the same end results.

A point of design

The main problem I see here though is that you even have a "history" array within the content. This is not really a great idea since you need to do things like above in order to only return the relevant item you want.

So as a "point of design", I would not do this. But instead I would "separate" the history from the items in all cases. Keeping with "embedded" documents, I would keep the "history" in a separate array, and only keep the "latest" revision with the actual content:

{
    "_id" : ObjectId("5935a41f12f3fac949a5f925"),
    "project_id" : 13,
    "updated_at" : ISODate("2017-07-02T22:11:43.426Z"),
    "created_at" : ISODate("2017-06-05T18:34:07.150Z"),
    "owner" : ObjectId("591eea4439e1ce33b47e73c3"),
    "name" : "Demo project",
    "uploaded_files" : [ 
        {
            "latest" : { 
                {
                    "file" : ObjectId("59596f9fb6c89a031019bcae"),
                    "revision" : 1
                }
            },
            "_id" : ObjectId("59596f9fb6c89a031019bcaf"),
            "display_name" : "Example filename.txt"
        }
    ]
    "file_history": [
      { 
        "_id": ObjectId("59596f9fb6c89a031019bcaf"),
        "file": ObjectId("59596f9fb6c89a031019bcae"),
        "revision": 0
    },
    { 
        "_id": ObjectId("59596f9fb6c89a031019bcaf"),
        "file": ObjectId("59596f9fb6c89a031019bcae"),
        "revision": 1
    }

}

You can maintain this simply by setting $set the relevant entry and using $push on the "history" in the one operation:

.update(
  { "project_id": projectId, "uploaded_files._id": fileId }
  { 
    "$set": {
      "uploaded_files.$.latest": { 
        "file": revisionId,
        "revision": revisionNum
      }
    },
    "$push": {
      "file_history": {
        "_id": fileId,
        "file": revisionId,
        "revision": revisionNum
      }
    }
  }
)

With the array separated, then you can simply query and always get the lastest, and discard the "history" until such time as you actually want to make that request:

Project.findOne({ "project_id": projectId })
  .select('-file_history')      // The '-' here removes the field from results
  .populate(populateQuery)

As a general case though I would simply not bother with the "revision" number at all. Keeping much of the same structure you do not really need it when "appending" to an array since the "latest" is always the "last". This is also true of changing the structure, where again the "latest" will always be the last entry for the given uploaded file.

Trying to maintain such an "artificial" index is fraught with problems, and mostly ruins any change of "atomic" operations as shown in the .update() example here, since you need to know a "counter" value in order to supply the latest revision number, and therefore need to "read" that from somewhere.

Get most recent Sub-Document from Array

Answers (1)

Query via aggregate

Query via populate

A point of design

Related Questions