Reputation: 5137
I have a mongo db with a collection called "ActiveTracking", with a custom key that is "dates". Periodically, I get new documents in bulk that might have duplicate "dates, and new "dates".
my dictionary of records looks like this:
dicto = [{'_id': Timestamp('2004-02-25 00:00:00'),
'low': 2.809999942779541,
'volume': 12800,
'open': 2.9000000953674316,
'high': 2.9000000953674316,
'close': 2.819999933242798,
'adjclose': 1.5342552661895752,
'dividends': 0.0},
{'_id': Timestamp('2004-02-26 00:00:00'),
'low': 2.819999933242798,
'volume': 59500,
'open': 2.8499999046325684,
'high': 2.9000000953674316,
'close': 2.890000104904175,
'adjclose': 1.572339653968811,
'dividends': 0.0},]
For example, the first record is in the db, the second is not. If I do:
collection = db["STOCK"]
collection.insert_many(dicto, ordered=False)
returns
BulkWriteError: batch op errors occurred
My collection has thousands of records, and the "bulk" I receive might contain 100s (as opposed to the 2 I show in the example). Is there anyway to write to the db ONLY the unique ids, in bulk?
Update The following code might work, but I am trying to avoid iterating over the dictionary to be inserted (to check for duplicates), before inserting. I prefer a solution that does not iterate over a long list, to discriminate what to insert, since it could be time consuming.
to_be_inserted = []
for d in dicto:
x = collection.find_one(d)
if type(x) != dict:
to_be_inserted.append(d)
collection.insert_many(to_be_inserted)
Upvotes: 0
Views: 296
Reputation: 1348
The following pseudocode should work. Check for the existence of the record using find_one
and if the record is not present add it to a to_be_inserted
list. Insert
all in a batch at the end.
As _id
is always indexed by default you will get very fast performance on your find_one
.
If you know some of the properties of your timestamps you may be able to optimise further by keeping track of your oldest and newest timestamp and seeing if incoming timestamps are within or outside that range.
to_be_inserted = []
for d in candidate_records:
if col.find_one(d["_id"]):
continue
else:
to_be_inserted.append(d)
if len(d) > 0:
col.insert_many(to_be_inserted)
to_be_inserted = []
Upvotes: 1