How to return distinct $or in mongodb?

Question

So I have this query

db.collection.find($or:[{data_id:123},{data_id:345},{data_id:443}]);

How do I tweak it to return only one of each part of the $or. I.E something analogous to the SQL:

SELECT DISTINCT data_id, [...] WHERE data_id='123' OR data_id='345'...

Neil Lunn · Accepted Answer

Your question needs to be considered with consideration to the documents you have as "distinct" can mean a few different things here. Consider the following sample:

{
    "tripId": 123,
    "thisField": "this",
    "thatField": "that"
},
{
    "tripId": 123,
    "thisField": "other",
    "thatField": "then"
},
{
    "tripId": 345,
    "thisField": "other",
    "thatField": "then"
},
{
    "tripId": 345,
    "thisField": "this",
    "thatField": "that"
},
{
    "tripId": 123,
    "thisField": "this",
    "thatField": "that"
},
{
    "tripId": 789,
    "thisField": "this",
    "thatField": "that"
}

MongoDB has the .distinct() method which which would return distinct values for a single field, but only one field as well as the items are returned simply as an array of those field values.

For anything else you want the .aggregate() method. This is the aggregation pipeline which does a number of different functions and can handle some very complex operations due to the "pipeline" nature of it's processing.

Particularly here you would want to use a $group pipeline stage in order to "group" together values based on a key. That "key" is expressed in the form of an _id key in the $group statement. Much like "SELECT" in SQL with a "GROUP BY" or a "DISTINCT" modifier ( which are much the same in function ) you need to specify all of the fields you intend in the results.

Moreover, anything that would not be specified in a "GROUP BY" portion of a statement would have to be subject to some sort of "grouping operation" in order to select which field values to present. For this there are various "Group Accumulator Operators" to act on these values:

One example here using the $first operator in this case:

db.collection.aggregate([
    { "$match": {
        "tripId": { "$in": [ 123,345 ] }
    }},
    { "$group": {
        "_id": "$tripId",
        "thisField": { "$first": "$thisField" },
        "thatField": { "$first": "$thatField" },
        "total": { "$sum": 1 }
    }}
])

Gives this result:

{ "_id" : 345, "thisField" : "other", "thatField" : "then", "total" : 2 }
{ "_id" : 123, "thisField" : "this", "thatField" : "that", "total" : 3 }

So with the addition of a $sum operator to count the occurrences of the same distinct values this picks up the "first" occurrences of the values in the specified fields that were mentioned in the accumulator expressions outside of the grouping key.

In versions of MongoDB since 2.6 you can "shortcut" naming all of the fields you want individually using the $$ROOT expression variable. This is a reference to "all" of the fields present in the document as of the state in the current stage where it is used. It's a little shorter to write, but the output is a little different due to the syntax:

db.collection.aggregate([
    { "$match": {
        "tripId": { "$in": [ 123,345 ] }
    }},
    { "$group": {
        "_id": "$tripId",
        "doc": { "$first": "$$ROOT" },
        "total": { "$sum": 1 }
    }}
])

Outputs as:

{
    "_id" : 345,
    "doc" : {
            "_id" : ObjectId("54feaf3839c29b9cd470bcbe"),
            "tripId" : 345,
            "thisField" : "other",
            "thatField" : "then"
    },
    "total" : 2
}
{
    "_id" : 123,
    "doc" : {
            "_id" : ObjectId("54feaf3839c29b9cd470bcbc"),
            "tripId" : 123,
            "thisField" : "this",
            "thatField" : "that"
    },
    "total" : 3
}

That is a general case with most $group aggregation operations where you specify a "key" and subject other fields present to a "grouping operator"/"accumulator" of some sort.

The other case that if you were looking for the "distinct" occurrences of "all" fields, then you would express these as part of the "key" for the group expression like this:

db.collection.aggregate([
    { "$match": {
        "tripId": { "$in": [ 123,345 ] }
    }},
    { "$group": {
        "_id": {
            "tripId": "$tripId",
            "thisField": "$thisField",
            "thatField": "$thatField"
        },
        "total": { "$sum": 1 }
    }}
])

That gives us this output:

{
    "_id" : {
            "tripId" : 345,
            "thisField" : "this",
            "thatField" : "that"
    },
    "total" : 1
}
{
    "_id" : {
            "tripId" : 345,
            "thisField" : "other",
            "thatField" : "then"
    },
    "total" : 1
}
{
    "_id" : {
            "tripId" : 123,
            "thisField" : "other",
            "thatField" : "then"
    },
    "total" : 1
}
{
    "_id" : {
            "tripId" : 123,
            "thisField" : "this",
            "thatField" : "that"
    },
    "total" : 2
}

The total result being 4 documents that considers the "distinct" values on each of the fields mentioned as part of the "composite key". It correctly accounts that most of those combinations occurred 1 time, with the exception of the one example that actually occurs twice with all the same values.

Naturally the $$ROOT variable would not apply here as the "whole document" contains the "unique" _id field for each document. You can always add a $project stage beforehand to filter that field out, but the same conditions to specifying the fields required applies:

db.collection.aggregate([
    { "$match": {
        "tripId": { "$in": [ 123,345 ] }
    }},
    { "$project": {
        "_id": 0,
        "tripId": 1,
        "thisField": 1,
        "thatField": 1
    }},
    { "$group": {
        "_id": "$$ROOT",
        "total": { "$sum": 1 }
    }}
])

So that serves as an introduction with examples of what you can do in the form of "distinct" queries with MongoDB and specifically the aggregation framework. There are various other common SQL to Aggregation mapping examples given in the documentation.

The other general case was your usage of $or in your question. As you see in the samples here, when you want the same "or" condition over values of the same field, then the more efficient way to write this in your query is with the $in operator. Rather than an array of "query documents" this takes and array of "possible values" to the common field it is examining in the expression. It is basically a $or condition, but expressed in a shorter form for this case.

How to return distinct $or in mongodb?

Answers (2)

Related Questions