MongoDB $redact to filter out some elements of an array

Question

I am trying to formulate a query over the sample bios collection http://docs.mongodb.org/manual/reference/bios-example-collection/:

Retrieve all persons and their awards that they received before receiving a Turing award.

I have come up with this query:

db.bios.aggregate([
    {$match: {"awards.award" : "Turing Award"}},
    {$project: {"award1": "$awards", "award2": "$awards", "first_name": "$name.first", "last_name": "$name.last"}},
    {$unwind: "$award1"},
    {$match: {"award1.award" : "Turing Award"}},
    {$unwind: "$award2"},
    {$redact: {
        $cond: {
           if: { $eq: [ { $gt: [ "$award1.year", "$award2.year"] }, true]},
           then: "$$KEEP",
           else: "$$PRUNE"
           }
        }
    }
])

And this is the answer:

/* 0 */
{
    "result" : [ 
    {
        "_id" : 1,
        "award1" : {
            "award" : "Turing Award",
            "year" : 1977,
            "by" : "ACM"
        },
        "award2" : {
            "award" : "W.W. McDowell Award",
            "year" : 1967,
            "by" : "IEEE Computer Society"
        },
        "first_name" : "John",
        "last_name" : "Backus"
    }, 
    {
        "_id" : 1,
        "award1" : {
            "award" : "Turing Award",
            "year" : 1977,
            "by" : "ACM"
        },
        "award2" : {
            "award" : "National Medal of Science",
            "year" : 1975,
            "by" : "National Science Foundation"
        },
        "first_name" : "John",
        "last_name" : "Backus"
    }, 
    {
        "_id" : 4,
        "award1" : {
            "award" : "Turing Award",
            "year" : 2001,
            "by" : "ACM"
        },
        "award2" : {
            "award" : "Rosing Prize",
            "year" : 1999,
            "by" : "Norwegian Data Association"
        },
        "first_name" : "Kristen",
        "last_name" : "Nygaard"
    }, 
    {
        "_id" : 5,
        "award1" : {
            "award" : "Turing Award",
            "year" : 2001,
            "by" : "ACM"
        },
        "award2" : {
            "award" : "Rosing Prize",
            "year" : 1999,
            "by" : "Norwegian Data Association"
        },
        "first_name" : "Ole-Johan",
        "last_name" : "Dahl"
    }
],
"ok" : 1
}

What I don't like about this solution is that I unwind $award2. Instead, I would be happy to keep award2 as an array, and only remove those awards that were received after award1. So, for instance, the answer for John Backus should be:

{
    "_id" : 1,
    "first_name" : "John",
    "last_name" : "Backus",
    "award1" : {
        "award" : "Turing Award",
        "year" : 1977,
        "by" : "ACM"
    },
    "award2" : [ 
        {
            "award" : "W.W. McDowell Award",
            "year" : 1967,
            "by" : "IEEE Computer Society"
        }, 
        {
            "award" : "National Medal of Science",
            "year" : 1975,
            "by" : "National Science Foundation"
        }
    ]
}

Is it possible to achieve it with $redact without doing $unwind: "$award2"?

Blakes Seven · Accepted Answer

It might have been a little more helpful if you had included the original state of the document as an example in your question as this clearly shows "where you are coming from" and then to "where you want to get to" as a goal in addition to your desired output as given.

That's just a tip, but it seems that you are starting with a document like this:

{
    "_id" : 1,
    "name": { 
        "first" : "John",
        "last" : "Backus"
    },
    "awards" : [
        {
            "award" : "W.W. McDowell Award",
            "year" : 1967,
            "by" : "IEEE Computer Society"
        }, 
        {
            "award" : "National Medal of Science",
            "year" : 1975,
            "by" : "National Science Foundation"
        },
        { 
            "award" : "Turing Award",
            "year" : 1977,
            "by" : "ACM"
        },
        {
            "award" : "Some other award",
            "year" : 1979,
            "by" : "Someone Else"
        }
    ]
}

So the real points here is that while you may haved reached for $redact here (and it is a bit nicer than using $project for a logical condition and then using $match to filter that logical match ) this probably isn't the best tool for the comparison you want to do here.

Before moving on I just want to point out the main problem here with $redact. Whatever you do here the logic ( without an unwind ) would be essentially to compare "directly" on $$DESCEND in order to process the the array elements on the value of "year" at whatever level.

That recursion is going to invalidate the "award1" condition as well since it has the same field name. Even renaming that field kills the logic since a projected value where it was missing would not be greater than the tested value.

In a nutshell, $redact is ruled right out since you cannot say "take from here only" with the logic it applies.

The alternate is to use $map and $setDifference to filter contents from the arrays as follows:

db.bios.aggregate([
    { "$match": { "awards.award": "Turing Award" } },
    { "$project": {
        "first_name": "$name.first",
        "last_name": "$name.last",
        "award1": { "$setDifference": [
            { "$map": {
                "input": "$awards",
                "as": "a",
                "in": { "$cond": [
                    { "$eq": [ "$$a.award", "Turing Award" ] },
                    "$$a",
                    false
                ]}
            }},
            [false]
        ]},
        "award2": { "$setDifference": [
            { "$map": {
                "input": "$awards",
                "as": "a",
                "in": { "$cond": [
                    { "$ne": [ "$$a.award", "Turing Award" ] },
                    "$$a",
                    false
                ]}
            }},
            [false]
        ]}
    }},
    { "$unwind": "$award1" },
    { "$project": {
        "first_name": 1,
        "last_name": 1,
        "award1": 1,
        "award2": { "$setDifference": [
            { "$map": {
                "input": "$award2",
                "as": "a",
                "in": { "$cond": [
                     { "$gt": [ "$award1.year", "$$a.year" ] },
                     "$$a",
                     false
                 ]}
            }},
            [false]            
        ]}
    }}
])

And there really is no "pretty" way of getting around either the usage of $unwind in the itermediatary stage or even the second $project here, since $map ( and the $setDifference filter ) returns what is "still an array". So the $unwind is necessary to make the "array" a singular ( provided your condition only matches 1 element ) entry for which to use in comparison.

Trying to "squish" all the logic in a single $project will only result in "arrays of arrays" in the second output, and still some "unwinding" therefore required, but at least this way unwinding the (hopefully) 1 match is not really that costly and keeps the output clean.

But the other thing to really note here is that you are not really "aggregating" anything here at all. This is just document manipulation, so you might well consider to just do that manipulation directly in client code. As demonstrated with this shell example:

db.bios.find(
    { "awards.award": "Turing Award" },
    { "name": 1, "awards": 1 }
).forEach(function(doc) {
    doc.first_name = doc.name.first;
    doc.last_name = doc.name.last;
    doc.award1 = doc.awards.filter(function(award) {
        return award.award == "Turing Award"
    })[0];
    doc.award2 = doc.awards.filter(function(award) {
        return doc.award1.year > award.year;
    });
    delete doc.name;
    delete doc.awards;
    printjson(doc);
})

At any rate, both approaches will output the same:

{
    "_id" : 1,
    "first_name" : "John",
    "last_name" : "Backus",
    "award1" : {
            "award" : "Turing Award",
            "year" : 1977,
            "by" : "ACM"
    },
    "award2" : [
            {
                    "award" : "W.W. McDowell Award",
                    "year" : 1967,
                    "by" : "IEEE Computer Society"
            },
            {
                    "award" : "National Medal of Science",
                    "year" : 1975,
                    "by" : "National Science Foundation"
            }
    ]
}

The only real difference here is that by using .aggregate() the content of "award2" will already be filtered when returned from the server, which probably isn't going to be that much different from doing the client processing approach unless the items that would be removed comprises a reasonably large list per document.

For the record, the only alteration to your existing aggregation pipeline really required here would be to add a $group to the end to "re-combine" the array entries into a single document:

db.bios.aggregate([
    { "$match": { "awards.award": "Turing Award" } },
    { "$project": {
        "first_name": "$name.first", 
        "last_name": "$name.last",
        "award1": "$awards",
        "award2": "$awards"
    }},
    { "$unwind": "$award1" },
    { "$match": {"award1.award" : "Turing Award" }},
    { "$unwind": "$award2" },
    { "$redact": {
        "$cond": {
             "if": { "$gt": [ "$award1.year", "$award2.year"] },
             "then": "$$KEEP",
             "else": "$$PRUNE"
        }
    }},
    { "$group": {
        "_id": "$_id",
        "first_name": { "$first": "$first_name" },
        "last_name": { "$first": "$last_name" },
        "award1": { "$first": "$award1" },
        "award2": { "$push": "$award2" }
    }}
])

But then again, there is all that "array duplication" and the "cost of unwind" associated with all the operations here. So either of the first two approaches is what you really want in order to avoid that.

MongoDB $redact to filter out some elements of an array

Answers (2)

Related Questions