Inner Join on two Fields

Question

I have the following schemas

var User = mongoose.Schema({
    email:{type: String, trim: true, index: true, unique: true, sparse: true},
    password: String,
    name:{type: String, trim: true, index: true, unique: true, sparse: true},
    gender: String,
});

var Song = Schema({
    track: { type: Schema.Types.ObjectId, ref: 'track' },//Track can be deleted
    author: { type: Schema.Types.ObjectId, ref: 'user' },
    url: String,
    title: String,
    photo: String,
        publishDate: Date,
    views: [{ type: Schema.Types.ObjectId, ref: 'user' }],
    likes: [{ type: Schema.Types.ObjectId, ref: 'user' }],
    collaborators: [{ type: Schema.Types.ObjectId, ref: 'user' }],
});

I want to select all users (without the password value) , but I want each user will have all the songs where he is the author or one of the collaborators and the was published in the last 2 weeks.

What is the best strategy perform this action (binding between the user.id and song .collaborators) ? Can it be done in one select?

Neil Lunn · Accepted Answer

It's very possible in one request, and the basic tool for this with MongoDB is $lookup.

I would think this actually makes more sense to query from the Song collection instead, since your criteria is that they must be listed in one of two properties on that collection.

Optimal INNER Join - Reversed

Presuming the actual "model" names are what is listed above:

var today = new Date.now(),
    oneDay = 1000 * 60 * 60 * 24,
    twoWeeksAgo = new Date(today - ( oneDay * 14 ));

var userIds;   // Should be assigned as an 'Array`, even if only one

Song.aggregate([
  { "$match": { 
    "$or": [
      { "author": { "$in": userIds } },
      { "collaborators": { "$in": userIds } }
    ],
    "publishedDate": { "$gt": twoWeeksAgo }
  }},
  { "$addFields": { 
    "users": { 
      "$setIntersection": [ 
        userIds,
        { "$setUnion": [ ["$author"], "$collaborators" ] }
      ]
    }
  }},
  { "$lookup": {
    "from": User.collection.name,
    "localField": "users",
    "foreignField": "_id",
    "as": "users"
  }},
  { "$unwind": "$users" },
  { "$group": {
    "_id": "$users._id",
    "email": { "$first": "$users.email" },
    "name": { "$first": "$users.name" },
    "gender": { "$first": "$users.gender" },
    "songs": {
      "$push": {
        "_id": "$_id",
        "track": "$track",
        "author": "$author",
        "url": "$url",
        "title": "$title",
        "photo": "$photo",
        "publishedDate": "$publishedDate",
        "views": "$views",
        "likes": "$likes",
        "collaborators": "$collaborators"
      }
    }
  }}
])

That to me is the most logical course as long as it's an "INNER JOIN" you want from the results, meaning that "all users MUST have a mention on at least one song" in the two properties involved.

The $setUnion takes the "unique list" ( ObjectId is unique anyway ) of combining those two. So if an "author" is also a "collaborator" then they are only listed once for that song.

The $setIntersection "filters" the list from that combined list to only those that were specified in the query condition. This removes any other "collaborator" entries that would not have been in the selection.

The $lookup does the "join" on that combined data to get the users, and the $unwind is done because you want the User to be the main detail. So we basically reverse the "array of users" into "array of songs" in the result.

Also, since the main criteria is from Song, then it makes sense to query from that collection as the direction.

Optional LEFT Join

Doing this the other way around is where the "LEFT JOIN" is wanted, being "ALL Users" regardless if there are any associated songs or not:

User.aggregate([
  { "$lookup": {
    "from": Song.collection.name,
    "localField": "_id",
    "foreignField": "author",
    "as": "authors"
  }},
  { "$lookup": {
    "from": Song.collection.name,
    "localField": "_id",
    "foreignField": "collaborators",
    "as": "collaborators"
  }},
  { "$project": {
    "email": 1,
    "name": 1,
    "gender": 1,
    "songs": { "$setUnion": [ "$authors", "$collaborators" ] }
  }}
])

So the listing of the statement "looks" shorter, but it is forcing "two" $lookup stages in order to obtain results for possible "authors" and "collaborators" rather than one. So the actual "join" operations can be costly in execution time.

The rest is pretty straightforward in applying the same $setUnion but this time the the "result arrays" rather than the original source of the data.

If you wanted similar "query" conditions to above on the "filter" for the "songs" and not the actual User documents returned, then for LEFT Join you actually $filter the array content "post" $lookup:

User.aggregate([
  { "$lookup": {
    "from": Song.collection.name,
    "localField": "_id",
    "foreignField": "author",
    "as": "authors"
  }},
  { "$lookup": {
    "from": Song.collection.name,
    "localField": "_id",
    "foreignField": "collaborators",
    "as": "collaborators"
  }},
  { "$project": {
    "email": 1,
    "name": 1,
    "gender": 1,
    "songs": { 
      "$filter": {
        "input": { "$setUnion": [ "$authors", "$collaborators" ] },
        "as": "s",
        "cond": { 
          "$and": [
            { "$setIsSubset": [
              userIds
              { "$setUnion": [ ["$$s.author"], "$$s.collaborators" ] }
            ]},
            { "$gte": [ "$$s.publishedDate", oneWeekAgo ] }
          ]
        }
      }
    }
  }}
])

Which would mean that by LEFT JOIN Conditions, ALL User documents are returned but the only ones which will contain any "songs" will be those that met the "filter" conditions as being part of the supplied userIds. And even those users which were contained in the list will only show those "songs" within the required range for publishedDate.

The main addition within the $filter is the $setIsSubset operator, which is a short way of comparing the supplied list in userIds to the "combined" list from the two fields present in the document. Noting here the the "current user" already had to be "related" due to the earlier conditions of each $lookup.

MongoDB 3.6 Preview

A new "sub-pipeline" syntax available for $lookup from the MongoDB 3.6 release means that rather than "two" $lookup stages as shown for the LEFT Join variant, you can in fact structure this as a "sub-pipeline", which also optimally filters content before returning results:

User.aggregate([
  { "$lookup": {
    "from": Song.collection.name,
    "let": {
      "user": "$_id"
    },
    "pipeline": [
      { "$match": {
        "$or": [
          { "author": { "$in": userIds } },
          { "collaborators": { "$in": userIds } }
        ],
        "publishedDate": { "$gt": twoWeeksAgo },
        "$expr": {
          "$or": [
            { "$eq": [ "$$user", "$author" ] },
            { "$setIsSubset": [ ["$$user"], "$collaborators" ]
          ]
        }
      }}
    ],
    "as": "songs"
  }}
])

And that is all there is to it in that case, since $expr allows usage of the $$user variable declared in "let" to be compared with each entry in the song collection to select only those that are matching in addition to the other query criteria. The result being only those matching songs per user or an empty array. Thus making the whole "sub-pipeline" simply a $match expression, which is pretty much the same as additional logic as opposed to fixed local and foreign keys.

So you could even add a stage to the pipeline following $lookup to filter out any "empty" array results, making the overall result an INNER Join.

So personally I would go for the first approach when you can and only use the second approach where you need to.

NOTE: There are a couple of options here that don't really apply as well. The first being a special case of $lookup + $unwind + $match coalescence in which whilst the basic case applies to the initial INNER Join example it cannot be applied with the LEFT Join Case.

This is because in order for a LEFT Join to be obtained, the usage of $unwind must be implemented with preserveNullAndEmptyArrays: true, and this breaks the rule of application in that the unwinding and matching cannot be "rolled up" within the $lookup and applied to the foreign collection "before" returning results.

Hence why it is not applied in the sample and we use $filter on the returned array instead, since there is no optimal action that can be applied to the foreign collection "before" the results are returned, and nothing stopping all results for songs matching on simply the foreign key from returning. INNER Joins are of course different.

The other case is .populate() with mongoose. The most important distinction being that .populate() is not a single request, but just a programming "shorthand" for actually issuing multiple queries. So at any rate, there would actually be multiple queries issued and always requiring ALL results in order to apply any filtering.

Which leads to the limitation on where the filtering is actually applied, and generally means that you cannot really implement "paging" concepts when you utilize "client side joins" that require conditions to be applied on the foreign collection.

There are some more details on this on Querying after populate in Mongoose, and an actual demonstration of how the basic functionality can be wired in as a custom method in mongoose schema's anyway, but actually using the $lookup pipeline processing underneath.

Inner Join on two Fields

Answers (1)

Optimal INNER Join - Reversed

Optional LEFT Join

MongoDB 3.6 Preview

Related Questions