MongoDB - Remove duplicates in query

Question

I have to work with MongoDB for my job, but I'm not very comfortable with it. I have to gather some documents and remove duplicates according to one field.

Here is a (very very) simplified structure of a document :

{
    'user': 'The User',
    'report': {
        'id' : 0
        ...
    }
}

A user can have several reports, and several identical reports (not a conception mistake, only the simplified structure makes it strange).

A report is only related to one user.

I would like to retrieve a set of Reports-User by removing all duplicates reports id. Here is an example :

# Datas

User    | Report ID
--------|----------
User1   | AAAA
User1   | AAAA
User1   | BBBB
User2   | CCCC
User3   | DDDD
User3   | DDDD

# Excepted output where each line represents a document

User    | Report ID
--------|----------
User1   | AAAA
User1   | BBBB
User2   | CCCC
User3   | DDDD

I am really confused with all the aggregators. How can I do this?

c1moore · Accepted Answer

This is pretty straight forward using the $group operator in aggregation pipeline.

First, my sample data:

[
    { 'user': 'User1', report: { id: 'AAAA' } },
    { 'user': 'User1', report: { id: 'BBBB' } },
    { 'user': 'User1', report: { id: 'AAAA' } },
    { 'user': 'User2', report: { id: 'CCCC' } },
    { 'user': 'User3', report: { id: 'DDDD' } },
    { 'user': 'User3', report: { id: 'DDDD' } }
]

To get the same Expected format you posted, you can execute the following query:

db.reports.aggregate([
    {
        $group: {
            _id: "$report.id",
            user: {
                $first: '$user'
            }
        }
    },
    {
        $project: {
            _id: 0,
            User: '$user',
            Report: '$_id'
        }
    }
])

The first step in this aggregation pipeline groups all of the items in your collection by report.id. Notice the dot notation to reference a field the embedded document. It also projects the user field by selecting the value of the user field on the first document mongo finds with that report ID. You mention that report IDs are unique to users, so this shouldn't cause any problems.

The second step in this aggregation pipeline just renames the fields to the names you used for your expected format. The $group operator sets the _id field of the output to the field you grouped by (in this case, report.id). The $project command uses that value to set the Report field and unsets the _id.

MongoDB - Remove duplicates in query

Answers (1)

Related Questions