floatingpurr
floatingpurr

Reputation: 8599

PyMongo Bulk does not return after inserting 12k documents

I'm using pymongo 3.4 on macOS to bulk insert 12k big documents. Each doc is a piece of a time series with 365 values, so it's quite huge. I'm doing something like this:

bulk = db.test.initialize_unordered_bulk_op()
for i in range(1,12000):
  bulk.insert(TimeSeries.getDict(i))
bulk.execute()

The problem is that bulk.execute() does not return. Is there a kind of performance problem or a dimensional constraint?

Upvotes: 1

Views: 291

Answers (1)

chridam
chridam

Reputation: 103445

Consider putting your insert bulk operations into manageable batches of say 500 because write commands can accept no more than 1000 operations (from the docs), you will have to split bulk operations into multiple batches, in this case you can choose an arbitrary batch size of up to 1000.

The reason for choosing 500 is to ensure that the sum of the associated document from the Bulk.insert() is less than or equal to the maximum BSON document size, even though there is no there is no guarantee using the default 1000 operations requests will fit under the 16MB BSON limit. The Bulk() operations in the mongo shell and comparable methods in the drivers do not have this limit though.

Doing the math, you'd want to be sure those 500 insert operation requests themselves do not actually create a BSON document greater than 16MB i.e. for an input document with 365 values, you need to determine the scale factor that will bring the total size of documents to 16MB or less. To me it seems like 365x500 is a reasonable guess for a size to be under 16MB, unlike 365x12000:

bulk = db.test.initialize_unordered_bulk_op()
counter = 0

for i in range(1, 12000):
    # process in bulk
    bulk.insert(TimeSeries.getDict(i))
    counter += 1

    if (counter % 500 == 0):
        bulk.execute()
        bulk = db.test.initialize_unordered_bulk_op()

if (counter % 500 != 0):
    bulk.execute()

--UPDATE--

Actually, the limit does not apply do the bulk API but instead

If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less.

Thanks to @Styvane for pointing this out.

Upvotes: 2

Related Questions