SQLAlchemy: Many to many cascade example from docs is super slow

Question

Disclaimer: I'm a database noob and this is my first question on StackOverflow. Happy to edit to make it as clear and up to code as possible.

In my application I have two tables: InputData (external data downloaded by my app) and OutputData (data created by my app). One or more InputData can be used to create one or more OutputData and when an InputData item is deleted, I want to delete all OutputData that was created using the deleted item as an input. Essentially a many-to-many association with cascade deletions.

Luckily for me, or so I thought, there is an exact example of what I want to do in the SQLAlchemy docs. However, after implementing this and working with it a bit, I've found it to be remarkably slow.

I did some benchmarking and found that 100,000 InputData, each with a single OutputData child (for a total of 100,000 OutputData) took almost 10 minutes to delete. In contrast, a one-to-many model with cascade deletes I have takes only 30 seconds to delete tables orders of magnitudes larger.

After a good deal of research, I have really only two ideas:

I am being extremely naive about the cost of many-to-many cascades and it actually makes sense that it should take 10 minutes to delete 200,000 rows across 2 tables.
When a InputData item is deleted, the ENTIRE OutputData table is being iterated across to check which OutputData rows referenced the deleted item as their parent. This certainly makes sense in my gut, especially because I've read things about Postgres not automatically creating indexes on foreign keys, but I can't find any examples of people running into the same issue or how to fix it.
I'm deleting the wrong way. The above benchmarks were simply

db.session.query(InputData).delete()
db.session.commit()

My tables (exactly what is from the docs with different names):

association_table = Table('association', Base.metadata,
    Column('input_data_id', Integer, ForeignKey('input_data.id', ondelete="CASCADE")),
    Column('output_data_id', Integer, ForeignKey('output_data.id', ondelete="CASCADE"))
)

class InputData(Base):
    __tablename__ = 'input_data'
    id = Column(Integer, primary_key=True)
    children = relationship(
        "OutputData",
        secondary=association_table,
        back_populates="inputs",
        cascade="all, delete",
    )

class OutputData(Base):
    __tablename__ = 'output_data'
    id = Column(Integer, primary_key=True)
    parents = relationship(
        "InputData",
        secondary=association_table,
        back_populates="outputs",
        passive_deletes=True
    )

Thank you in advance!

SQLAlchemy: Many to many cascade example from docs is super slow

Answers (1)

Related Questions