Reputation: 9234
First off, this is my first project using SQLAlchemy, so I'm still fairly new.
I am making a system to work with GTFS data. I have a back end that seems to be able to query the data quite efficiently.
What I am trying to do though is allow for the GTFS files to update the database with new data. The problem that I am hitting is pretty obvious, if the data I'm trying to insert is already in the database, we have a conflict on the uniqueness of the primary keys.
For Efficiency reasons, I decided to use the following code for insertions, where model is the model object I would like to insert the data into, and data is a precomputed, cleaned list of dictionaries to insert.
for chunk in [data[i:i+chunk_size] for i in xrange(0, len(data), chunk_size)]:
engine.execute(model.__table__.insert(),chunk)
There are two solutions that come to mind.
I find a way to do the insert, such that if there is a collision, we don't care, and don't fail. I believe that the code above is using the TableClause, so I checked there first, hoping to find a suitable replacement, or flag, with no luck.
Before we perform the cleaning of the data, we get the list of primary key values, and if a given element matches on the primary keys, we skip cleaning and inserting the value. I found that I was able to get the PrimaryKeyConstraint from Table.primary_key, but I can't seem to get the Columns out, or find a way to query for only specific columns (in my case, the Primary Keys).
Either should be sufficient, if I can find a way to do it.
After looking into both of these for the last few hours, I can't seem to find either. I was hoping that someone might have done this previously, and point me in the right direction.
Thanks in advance for your help!
Update 1: There is a 3rd option I failed to mention above. That is to purge all the data from the database, and reinsert it. I would prefer not to do this, as even with small GTFS files, there are easily hundreds of thousands of elements to insert, and this seems to take about half an hour to perform, which means if this makes it to production, lots of downtime for updates.
Upvotes: 0
Views: 1160
Reputation: 1121614
With SQLAlchemy, you simply create a new instance of the model class, and merge it into the current session. SQLAlchemy will detect if it already knows about this object (from cache or the database) and will add a new row to the database if needed.
newentry = model(chunk)
session.merge(newentry)
Also see this question for context: Fastest way to insert object if it doesn't exist with SQLAlchemy
Upvotes: 1