Abe Miessler
Abe Miessler

Reputation: 85046

Using the aggregation framework to compare array element overlap

I have a collections with documents structured like below:

{
   carrier: "abc",
   flightNumber: 123,
   dates: [
      ISODate("2015-01-01T00:00:00Z"),
      ISODate("2015-01-02T00:00:00Z"),
      ISODate("2015-01-03T00:00:00Z")
    ]
}

I would like to search the collection to see if there are any documents with the same carrier and flightNumber that also have dates in the dates array that over lap. For example:

{
   carrier: "abc",
   flightNumber: 123,
   dates: [
      ISODate("2015-01-01T00:00:00Z"),
      ISODate("2015-01-02T00:00:00Z"),
      ISODate("2015-01-03T00:00:00Z")
    ]
},
{
   carrier: "abc",
   flightNumber: 123,
   dates: [
      ISODate("2015-01-03T00:00:00Z"),
      ISODate("2015-01-04T00:00:00Z"),
      ISODate("2015-01-05T00:00:00Z")
    ]
}

If the above records were present in the collection I would like to return them because they both have carrier: abc, flightNumber: 123 and they also have the date ISODate("2015-01-03T00:00:00Z") in the dates array. If this date were not present in the second document then neither should be returned.

Typically I would do this by grouping and counting like below:

db.flights.aggregate([
  { 
    $group: { 
       _id: { carrier: "$carrier", flightNumber: "$flightNumber" }, 
       uniqueIds: { $addToSet: "$_id" },
       count: { $sum: 1 } 
    }
  }, 
  { 
    $match: { 
       count: { $gt: 1 } 
    }
  }
])

But I'm not sure how I could modify this to look for array overlap. Can anyone suggest how to achieve this?

Upvotes: 0

Views: 165

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151092

You $unwind the array if you want to look at the contents as "grouped" within them:

db.flights.aggregate([
  { "$unwind": "$dates" },
  { "$group": {
    "_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
     "count": { "$sum": 1 },
     "_ids": { "$addToSet": "$_id" }
  }},
  { "$match": { "count": { "$gt": 1 } } },
  { "$unwind": "$_ids" },
  { "$group": { "_id": "$_ids" } }
])

That does in fact tell you which documents where the "overlap" resides, because the "same dates" along with the other same grouping key values that you are concerned about have a "count" which occurs more than once. Indicating the overlap.

Anything after the $match is really just for "presentation" as there is no point reporting the same _id value for multiple overlaps if you just want to see the overlaps. In fact if you want to see them together it would probably be best to leave the "grouped set" alone.

Now you could add a $lookup to that if retrieving the actual documents was important to you:

db.flights.aggregate([
  { "$unwind": "$dates" },
  { "$group": {
    "_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
     "count": { "$sum": 1 },
     "_ids": { "$addToSet": "$_id" }
  }},
  { "$match": { "count": { "$gt": 1 } } },
  { "$unwind": "$_ids" },
  { "$group": { "_id": "$_ids" } },
  }},
  { "$lookup": {
    "from": "flights",
    "localField": "_id",
    "foreignField": "_id",
    "as": "_ids"
  }},
  { "$unwind": "$_ids" },
  { "$replaceRoot": {
    "newRoot": "$_ids"
  }}
])

And even do a $replaceRoot or $project to make it return the whole document. Or you could have even done $addToSet with $$ROOT if it was not a problem for size.

But the overall point is covered in the first three pipeline stages, or mostly in just the "first". If you want to work with arrays "across documents", then the primary operator is still $unwind.


Alternately for a more "reporting" like format:

db.flights.aggregate([
  { "$addFields": { "copy": "$$ROOT" } }, 
  { "$unwind": "$dates" },
  { "$group": {
    "_id": {
      "carrier": "$carrier",
      "flightNumber": "$flightNumber",
      "dates": "$dates" 
    },
    "count": { "$sum": 1 },
    "_docs": { "$addToSet": "$copy" }  
  }},
  { "$match": { "count": { "$gt": 1 } } },
  { "$group": {
    "_id": {
      "carrier": "$_id.carrier",
      "flightNumber": "$_id.flightNumber",
    },
    "overlaps": {
      "$push": {
        "date": "$_id.dates",
        "_docs": "$_docs"  
      }  
    }  
  }}
])

Which would report the overlapped dates within each group and tell you which documents contained the overlap:

{
    "_id" : {
        "carrier" : "abc",
        "flightNumber" : 123.0
    },
    "overlaps" : [ 
        {
            "date" : ISODate("2015-01-03T00:00:00.000Z"),
            "_docs" : [ 
                {
                    "_id" : ObjectId("5977f9187dcd6a5f6a9b4b97"),
                    "carrier" : "abc",
                    "flightNumber" : 123.0,
                    "dates" : [ 
                        ISODate("2015-01-03T00:00:00.000Z"), 
                        ISODate("2015-01-04T00:00:00.000Z"), 
                        ISODate("2015-01-05T00:00:00.000Z")
                    ]
                }, 
                {
                    "_id" : ObjectId("5977f9187dcd6a5f6a9b4b96"),
                    "carrier" : "abc",
                    "flightNumber" : 123.0,
                    "dates" : [ 
                        ISODate("2015-01-01T00:00:00.000Z"), 
                        ISODate("2015-01-02T00:00:00.000Z"), 
                        ISODate("2015-01-03T00:00:00.000Z")
                    ]
                }
            ]
        }
    ]
}

Upvotes: 2

Related Questions