Reputation: 2318
Using Sqlalchemy With a large dataset, I would like to insert all rows using something efficient like session.add_all() followed by session.commit(). I am looking for a way to ignore inserting any rows which raise duplicate / unique key errors. The problem is that these errors only come up on the session.commit() call, so there is no way to fail that specific row and move onto the next.
The closest question I have seen is here: SQLAlchemy - bulk insert ignore: "Duplicate entry" ; however, the accepted answer proposes not using the bulk method and committing after every single row insert, which is extremely slow and causes huge amounts of I/O, so I am looking for a better solution.
Upvotes: 6
Views: 4712
Reputation: 19
Indeed. Same issue here. They seem to have forgotten performance, and especially when you have a remote DB this is an issue.
What I then always do is code around it in Python using a Dictionary or List. The trick is for instance in a Dictionary to set key and value to the same key data. i.e.
myEmailAddressesDict = {}
myEmailList = []
for emailAddress in allEmailAddresses:
if emailAddress not in myEmailAddressesDict:
#can add
myEmailList.append(emailAddress)
myEmailAddressesDict[emailAddress] = emailAddress
mySession = sessionmaker(bind=self.engine)
try:
mySession.add_all(myEmailList)
mySession.commit()
except Exception as e:
print("Add exception: ", str(e))
mySession.close()
It's not a fix to the actual problem but a sort of workaround for the moment. The key in this solution here is that you actually have cleared (delete_all) the DB or start with nothing. Otherwise, when you already have a DB then the code will fail nevertheless.
For this we need something like a parameter in SQLAlchemy to ignore dupes on the add_all or they should provide a merge_all.
Upvotes: 1