John Smith
John Smith

Reputation: 495

MongoDB Randomly Aggregate Documents (Unique Results)

I've read that one can use db.collection.aggregate with $sample to get random documents from a collection. But I've also read that the $sample is NOT 100% reliable, therefore, I wrote this query:

db.blog.aggregate(
   {"$sample": { "size": 100 } }, 
   {"$group": { "_id" : "$post_id", "post" : { "$push" : "$$ROOT" }}}
)

Yes, I am attempting to group by, but the issue is that in a loop it becomes more complicated then it should i.e. when querying the results from MongoDB.

Any suggestions are appreciated, thanks in advance.

EDIT: I want to know, is grouping necessary to get unique results out, or is there a better way of doing this. It does NOT make sense to have to the $group for aggregate to return me several random documents from the MongoDB that are unique and not duplicates.

YES, I set the ID to INDEX unique in the actual collection.

Upvotes: 1

Views: 1879

Answers (2)

Rajat Goel
Rajat Goel

Reputation: 2305

If you have a unique index over the post_id field than there is no need of group operation after sampling.

Refer: https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#faq-developers-isolate-cursors

Upvotes: 4

Tom Slabbaert
Tom Slabbaert

Reputation: 22296

Ok, lets begin in clarifying the $sample unique-ness issue as its not as straight forward as you might think.

First lets see the $sample conditions as specified in the docs:

  1. $sample is the first stage of the pipeline

  2. N is less than 5% of the total documents in the collection

  3. The collection contains more than 100 documents

If these conditions are not met mongo will perform a collection scan with sort and pick random documents (in this case no duplicates will occur).

Assuming these conditions ARE met then duplicate ids can occur by something called cursor isolation. This can only happen if you got update/delete operations on the collection that might 'fuck with' the indexing of it.

So assuming you're in this final case and your collection is being updated while you're querying it then grouping is your best shot if you want to ensure 100% that no dups will be returned. (with that said grouping on 100 documents is quite a small overhead to be worried about ).

Upvotes: 2

Related Questions