Luis Miguel
Luis Miguel

Reputation: 5137

pymongo inserting many documents in a mongo db that might contain duplicates

I have a mongo db with a collection called "ActiveTracking", with a custom key that is "dates". Periodically, I get new documents in bulk that might have duplicate "dates, and new "dates".

my dictionary of records looks like this:

dicto = [{'_id': Timestamp('2004-02-25 00:00:00'),
  'low': 2.809999942779541,
  'volume': 12800,
  'open': 2.9000000953674316,
  'high': 2.9000000953674316,
  'close': 2.819999933242798,
  'adjclose': 1.5342552661895752,
  'dividends': 0.0},
 {'_id': Timestamp('2004-02-26 00:00:00'),
  'low': 2.819999933242798,
  'volume': 59500,
  'open': 2.8499999046325684,
  'high': 2.9000000953674316,
  'close': 2.890000104904175,
  'adjclose': 1.572339653968811,
  'dividends': 0.0},]

For example, the first record is in the db, the second is not. If I do:

collection = db["STOCK"]
collection.insert_many(dicto, ordered=False)

returns

BulkWriteError: batch op errors occurred

My collection has thousands of records, and the "bulk" I receive might contain 100s (as opposed to the 2 I show in the example). Is there anyway to write to the db ONLY the unique ids, in bulk?

Update The following code might work, but I am trying to avoid iterating over the dictionary to be inserted (to check for duplicates), before inserting. I prefer a solution that does not iterate over a long list, to discriminate what to insert, since it could be time consuming.

to_be_inserted = []
for d in dicto:
    x = collection.find_one(d)
    if type(x) != dict:
        to_be_inserted.append(d)
collection.insert_many(to_be_inserted)

Upvotes: 0

Views: 296

Answers (2)

Joe Drumgoole
Joe Drumgoole

Reputation: 1348

The following pseudocode should work. Check for the existence of the record using find_one and if the record is not present add it to a to_be_inserted list. Insert all in a batch at the end.

As _id is always indexed by default you will get very fast performance on your find_one.

If you know some of the properties of your timestamps you may be able to optimise further by keeping track of your oldest and newest timestamp and seeing if incoming timestamps are within or outside that range.

to_be_inserted = []
for d in candidate_records:

    if col.find_one(d["_id"]):
        continue
    else:
        to_be_inserted.append(d)

if len(d) > 0:
    col.insert_many(to_be_inserted)
    to_be_inserted = []

Upvotes: 1

D. SM
D. SM

Reputation: 14530

This is called an unordered bulk write.

Upvotes: 0

Related Questions