expert
expert

Reputation: 30125

group in aggregate framework stopped working properly

I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.

I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.

db.users.aggregate(
    { $group : {_id : "$username", total : { $sum : 1 } } },
    { $match : { total : { $gte : 2 } } },
    { $sort : {total : -1} } );

which returns

{
        "result" : [
                {
                        "_id" : "davidbeges",
                        "total" : 2
                },
                {
                        "_id" : "jesusantonio",
                        "total" : 2
                },
                {
                        "_id" : "elesitasweet",
                        "total" : 2
                },
                {
                        "_id" : "theschoolofbmx",
                        "total" : 2
                },
                {
                        "_id" : "longflight",
                        "total" : 2
                },
                {
                        "_id" : "thenotoriouscma",
                        "total" : 2
                }
        ],
        "ok" : 1
}

I tested this query on sample collection with few documents and it works as expected.

Upvotes: 1

Views: 267

Answers (2)

Adam Comerford
Adam Comerford

Reputation: 21682

I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:

$match $sort $limit $skip

And they will work if placed before:

$project $unwind $group

However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.

Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.

If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.

Upvotes: 0

expert
expert

Reputation: 30125

One of 10gen responded in their JIRA.

Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique. If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.

So sorting by username before grouping helped.

Upvotes: 2

Related Questions