Query to Collectively Match Intersection of Documents against a Field

Question

In MongoDB, how can I return all documents that:

Share a value for a particular field.
Have a field equal to a value.
Between all documents that satisfy 1 & 2, they also collectively match all the fields of an inputted array of 1 to n values. I.e., every value must be accounted for.

E.g., given the below collection, I want to match all the documents that:

Have the same channelId
dayOfWeek is equal to Monday

Between all documents that satisfy 1 & 2, they should also collectively match product Ids [1, 2]. I.e., if I have a channelId that matches dayOfWeek, it also must have a single document that has productId of 1 and another document with productId of 2.

{
    channelId: "ID-A",
    dayOfWeek: "MONDAY",
    productId: "1"
}, 
{
    channelId: "ID-A",
    dayOfWeek: "MONDAY",
    productId: "2"
},
{
    channelId: "ID-B",
    dayOfWeek: "MONDAY",
    productId: "1"
},
{
    channelId: "ID-B",
    dayOfWeek: "MONDAY",
    productId: "3"
},
{
    channelId: "ID-C",
    dayOfWeek: "MONDAY",
    productId: "1"
},
{
    channelId: "ID-C",
    dayOfWeek: "TUESDAY",
    productId: "2"
}

The desired return in this case would be:

    {
        channelId: "ID-A",
        dayOfWeek: "MONDAY",
        productId: "1"
    }, 
    {
        channelId: "ID-A",
        dayOfWeek: "MONDAY",
        productId: "2"
    }

ID-B is removed since it doesn’t have a productId equal to 2.
ID-C is removed since although both productIds are matched, between the matched items they don’t have a dayOfWeek equal to Monday.

Neil Lunn · Accepted Answer

Basically you need to $group everything together by the common key after filtering on the common condition for dayOfWeek and then looking to see if the grouping does indeed "pair" and actually contains the two results you want to "intersect":

db.collection.aggregate([
  { "$match": { "dayOfWeek": "MONDAY" } },
  { "$group": {
    "_id": "$channelId",
    "docs": { "$push": "$$ROOT" },
    "count": { "$sum": 1 }
  }},
  { "$match": {
    "count": { "$gt": 1 },
    "docs": {
      "$all": [
        { "$elemMatch": { "productId": "1" } },
        { "$elemMatch": { "productId": "2" } }
      ]
    }
  }}
])

Once "grouped" the $all with $elemMatch conditions inside it ensure that "both" conditions are met within the "grouped" documents now inside the array. Also note that if you actually mean "only two" then you can look for "count": 2 as opposed to simply "count": { "$gt": 1 } which in example here means the "grouping" at least paired with "something", if not exactly "two".

That basically returns a result with the matching documents per grouping as:

{
        "_id" : "ID-A",
        "docs" : [
                {
                        "_id" : ObjectId("5b22f455fe0315289f716483"),
                        "channelId" : "ID-A",
                        "dayOfWeek" : "MONDAY",
                        "productId" : "1"
                },
                {
                        "_id" : ObjectId("5b22f455fe0315289f716484"),
                        "channelId" : "ID-A",
                        "dayOfWeek" : "MONDAY",
                        "productId" : "2"
                }
        ],
        "count" : 2
}

If you need the "documents only" as result, then you can take that further after $unwind with $replaceRoot if you have a supporting MongoDB version past 3.4:

db.collection.aggregate([
  { "$match": { "dayOfWeek": "MONDAY" } },
  { "$group": {
    "_id": "$channelId",
    "docs": { "$push": "$$ROOT" },
    "count": { "$sum": 1 }
  }},
  { "$match": {
    "count": { "$gt": 1 },
    "docs": {
      "$all": [
        { "$elemMatch": { "productId": "1" } },
        { "$elemMatch": { "productId": "2" } }
      ]
    }
  }},
  { "$unwind": "$docs" },
  { "$replaceRoot": { "newRoot": "$docs" } }
])

Or with $project and explicitly naming all fields where you don't:

db.collection.aggregate([
  { "$match": { "dayOfWeek": "MONDAY" } },
  { "$group": {
    "_id": "$channelId",
    "docs": { 
      "$push": {
        "_id": "$_id",
        "channelId": "$channelId",
        "dayOfWeek": "$dayOfWeek",
        "productId": "$productId
      }
    },
    "count": { "$sum": 1 }
  }},
  { "$match": {
    "count": { "$gt": 1 },
    "docs": {
      "$all": [
        { "$elemMatch": { "productId": "1" } },
        { "$elemMatch": { "productId": "2" } }
      ]
    }
  }},
  { "$unwind": "$docs" },
  { "$project": {
    "_id": "$docs._id",
    "channelId": "$docs.channelId",
    "dayOfWeek": "$docs.dayOfWeek",
    "productId": "$docs.productId"
  }}
])

Actually in that last form the statement is basically compatible with every release of MongoDB since the aggregation framework was released with version 2.2.

Alternately you can actually "use" the $setIsSubset operator as long as you have a MongoDB 3.6 or greater version with $expr:

db.collection.aggregate([
  { "$match": { "dayOfWeek": "MONDAY" } },
  { "$group": {
    "_id": "$channelId",
    "docs": { "$push": "$$ROOT" },
    "count": { "$sum": 1 }
  }},
  { "$match": {
    "count": { "$gt": 0 },
    "$expr": {
      "$setIsSubset": [ [ "1", "2" ], "$docs.productId" ]
    }
  }}
])

And you can even vary that with $redact or using $project followed by another $match, but it's really not the point as though you likely thought "set operators" these are not actually best suited to the particular results you are looking for here.

Note though that any kind of "intersection" or "subset" basically relies on being able to compare the documents against each other. This essentially means putting the "grouped" things into an array for such a comparison. If the actual results size causes such a "grouping" to exceed the BSON limit, them you cannot actually use such and approach and really have no other option than to load the matching documents to the initial query filter through a cursor and examine.

For "completeness" therefore you can possibly consider in that case where you have $lookup available to use a "self referencing join" as opposed to using $push to accumulate the matching documents:

db.collection.aggregate([
  { "$match": { "dayOfWeek": "MONDAY" } },
  { "$group": {
    "_id": "$channelId",
    "count": { "$sum": 1 }
  }},
  { "$match": { "count": { "$gt": 1 } } },  // keep only "multiple" groups
  { "$lookup": { 
    "from": "collection",
    "localField": "_id",
    "foreignField": "channelId",
    "as": "docs"
  }},
  { "$unwind": "$docs" },
  // ** See note below about the $match **
  //{ "$match": { "docs.productId": { "$in": [ "1", "2" ] } } },
])

The advantage here is the "array" of "docs" never actually gets constructed as per the special handling of $lookup + $unwind Coalescence which essentially "rolls up" the unwinding action to be within the $lookup itself. In this way you get the same documents was would be accumulated via the $push operation, but already "separated" into their own documents in a way which does not break the 16MB BSON limit.

The limitation remains however that the "set" cannot actually be compared whilst in this form as you need the "array" in order to see if "grouped" items are within the "set". So it's actually that "grouping" action which is being avoided to avoid the limit breach. Nonetheless this is generally better than simply iterating the cursor of matching documents to "MONDAY" alone since you already have an indication of the "grouped" results by "channelId".

The only other comparison that can be done there is using an additional $match using $in. This again would be "rolled up" into the actual $lookup operation to efficiently only return those documents which also matched that condition. The result however essentially "negates" as it's really just the same results that can be achieved using $in with the initial query, and of course means "only" those documents containing either "1" or "2" and not any others with respect to those values being "part of the subset" instead.

Query to Collectively Match Intersection of Documents against a Field

Answers (1)

Related Questions