Mongodb: ignore large documents ( BSON 16 MB) during collection.aggregate()

Question

I'm scanning a mongodb collection which has large docs containing bson greater than 16 MB in size. Essentially, I'm calling either of the 2 depending on the flag for random sampling:

documents = collection.aggregate(
                [{"$sample": {"size": sample_size}}], allowDiskUse=True)

OR

documents = collection.aggregate(
                [{"$limit": sample_size}], allowDiskUse=True)

sample_size is a parameter here.

The issue is that this command gets stuck for minutes over large bson and then eventually mongodb aborts the execution and my scan of the entire collection is not completed.

Is there a way to tell mongodb to skip/ignore documents having size larger than a threshold?

For those who think that MongoDB cannot store values larger than 16 MB, here is the error message by a metadata collector (LinkedIn DataHub):

OperationFailure: BSONObj size: 17375986 (0x10922F2) is invalid. 
Size must be between 0 and 16793600(16MB) First element: _id: "Topic XYZ",
full error: {'operationTime': Timestamp(1634531126, 2), 'ok': 0.0, 'errmsg': 'BSONObj size: 17375986 (0x10922F2) is invalid. Size must be between 0 and 16793600(16MB)

Takis · Accepted Answer

Document max size is 16 MB see
(Exception is the GridFS specification)

In your collection each document is already < 16MB, MongoDB does'nt allow us to store bigger documents.

If you want to filter lets say <10 MB
You can use the "$bsonSize" operator to get the size of a document and filter out the big ones.

Mongodb: ignore large documents ( BSON > 16 MB) during collection.aggregate()

Answers (1)

Related Questions

Mongodb: ignore large documents ( BSON &gt; 16 MB) during collection.aggregate()

Answers (1)

Related Questions

Mongodb: ignore large documents ( BSON > 16 MB) during collection.aggregate()