Reputation: 25639
How do I bulk update/insert in mongoDb with pymongo/pandas.
The error I get is batch op errors occurred
I reason I get is because I set the "_id"
, which I want to do. I code runs fine on first run, but on second run it fails. I want to use pandas in workflow. The data does have a datetime object.
The syntax is completely different for upsert = True
, with Update
. An efficient solution with update
would be helpful, where "_id"
or "qid"
could be set. But, there are python datetime objects!
InSQL = 'SELECT * from database2.table2 '
sqlOut = pd.read_sql(InSQL,cxn)
sqlOut['_id'] = "20170101" + ":"+ sqlOut['Var']
dfOut = sqlOut.to_json(orient='records',date_format='iso' )
try:
db["test"].insert_many(json.loads(dfOut))
except Exception as e: print e
I have given a 50pt bounty, which expired, with no answer. Hmm...
Upvotes: 10
Views: 7482
Reputation: 382
You get an error because you try to insert documents with fields which conflict with that of existing documents on the second and subsequent insert_many
calls. You correctly inferred it may be due to your setting _id
explicitly, which would then conflict with existing _id
values in the collection.
MongoDB automatically creates an unique index on _id
, which forbids duplicating values.
You need to update or replace your documents on calls after the first one (which inserted the documents in their first version). There is indeed a concept of "upsert" which will take care of inserting non-previously-existing documents in the collection as well as updating the existing ones.
Your options:
Most efficient: pymongo.collection.Collection.bulk_write
import pymongo
operations = [pymongo.operations.ReplaceOne(
filter={"_id": doc["_id"]},
replacement=doc,
upsert=True
) for doc in json.loads(dfOut)]
result = db["test"].bulk_write(operations)
# handle results
Note that it's efficiency also depends on whether the field is indexed in the collection, which incidentally is the case for _id
.
(also see pymongo.operations.ReplaceOne
)
Loop over your collection and calling pymongo.collection.Collection.update_one
or pymongo.collection.Collection.replace_one
(inefficient because not bulk)
import pymongo
results = []
for doc in json.load(dfOut):
result = db["test"].replace_one(
filter={"_id": doc["_id"]},
replacement=doc,
upsert=True
)
results.append(result)
# handle results
Note: pymongo.collection.Collection.update_many
seems unsuitable for your needs since you are not trying to set the same value on all matches of a given filter.
Upvotes: 8
Reputation: 99
batch op error maybe caused by duplicate _id, So delete the same _id documents already in mongo before inserting
Or use update_many https://api.mongodb.com/python/current/api/pymongo/collection.html?highlight=update#pymongo.collection.Collection.update_many
https://docs.mongodb.com/manual/reference/method/db.collection.updateMany/
Upvotes: 2