Using `sqlalchemy` `insert` to bulk insert related objects

Question

I'm using sqlalchemy to insert data into a table. The things I care about:

I expect to be inserting a lot of data so want minimize the number of database execution.
There will be duplicate entries, this is expected, so I'll use on_conflict_do_nothing (I'm using postgres).

As a minimal example, let's say I have two tables:

from sqlalchemy import Column, Integer, ForeignKey, String

Base = declarative_base()


class Address(Base):
    __tablename__ = "addresses"
    id = Column(Integer, primary_key=True)
    street = Column(String)
    number = Column(Integer)

class Person(Base)
    __tablename__ = "people"
    id = Column(Integer, primary_key=True)
    name = Column(String)
    address_id = Column(Integer, ForeignKey("addresses.id")
    address = relationship("Address")

I can batch up all the Address inserts easily enough with:

from sqlalchemy.dialects.postgresql import insert
values = [{"street": "main", "number": 1}, {"street": "main", "number": 2}]
statement = insert(Address).values(values).on_conflict_do_nothing()
session.execute(statement)

The question is how do I then do the same with the People? I'm struggling with what to supply for "address", e.g.

values = [{"name": "John", "address": ????}]
statement = insert(Person).values(values).on_conflict_do_nothing()
session.execute(statement)

I'm assuming I need the Address object but then I don't know where to go from there?

Ruben Helsloot · Accepted Answer

Don't use address, since it's not an SQL field. You can use INSERT...RETURNING to get the IDs of the addresses you just inserted, and then set values[0]['address_id'] based on that. Something like this should work:

from sqlalchemy.dialects.postgresql import insert
values = [{"street": "main", "number": 1}, {"street": "main", "number": 2}]
statement = insert(Address)\
    .values(values)\
    .returning(Address.id)\
    .on_conflict_do_nothing()
address_ids = [address_id for address_id, in session.execute(statement)]

values = [{"name": "John", "address_id": address_ids[0]}]
statement = insert(Person).values(values).on_conflict_do_nothing()
session.execute(statement)

Explanation

With insert().on_conflict_do_nothing, you're getting relatively close to the bare metal of PostgreSQL, compared to the added layers of SQLAlchemy. This is not a bad thing, it is of course where you have the most control and where the greatest speed ups happen, but it does mean that you need to think carefully about which fields are added by SQLAlchemy, and which fields are directly reflected by the database.

In your case, only id, name, and address_id are actual database columns. If you execute SELECT * FROM people, they are the only columns that should show up. In contrast, address is added by SQLAlchemy to give you a direct link to the Address instance with id address_id, but under water, a different call to the database is performed to fetch it, if necessary.

If you have an Address with id 1, then setting john.address = 1 doesn't work, because 1 is not an address, it's a number. However, setting john.address_id = 1, does work, because you're setting a field directly. SQLAlchemy discourages this if you're using the full ORM (which you're not), because you do this outside of their ecosystem. If you run john.address = Address.query.get(1), you assign the entire Address object to address, and SQLAlchemy updates address_id to 1 internally.

Using `sqlalchemy` `insert` to bulk insert related objects

Answers (1)

Related Questions