SQLAlchemy: Many to many cascade example from docs is super slow

Disclaimer: I'm a database noob and this is my first question on StackOverflow. Happy to edit to make it as clear and up to code as possible.

In my application I have two tables: InputData (external data downloaded by my app) and OutputData (data created by my app). One or more InputData can be used to create one or more OutputData and when an InputData item is deleted, I want to delete all OutputData that was created using the deleted item as an input. Essentially a many-to-many association with cascade deletions.

Luckily for me, or so I thought, there is an exact example of what I want to do in the SQLAlchemy docs. However, after implementing this and working with it a bit, I've found it to be remarkably slow.

I did some benchmarking and found that 100,000 InputData, each with a single OutputData child (for a total of 100,000 OutputData) took almost 10 minutes to delete. In contrast, a one-to-many model with cascade deletes I have takes only 30 seconds to delete tables orders of magnitudes larger.

After a good deal of research, I have really only two ideas:

  1. I am being extremely naive about the cost of many-to-many cascades and it actually makes sense that it should take 10 minutes to delete 200,000 rows across 2 tables.
  2. When a InputData item is deleted, the ENTIRE OutputData table is being iterated across to check which OutputData rows referenced the deleted item as their parent. This certainly makes sense in my gut, especially because I've read things about Postgres not automatically creating indexes on foreign keys, but I can't find any examples of people running into the same issue or how to fix it.
  3. I'm deleting the wrong way. The above benchmarks were simply
db.session.query(InputData).delete()
db.session.commit()

My tables (exactly what is from the docs with different names):

association_table = Table('association', Base.metadata,
    Column('input_data_id', Integer, ForeignKey('input_data.id', ondelete="CASCADE")),
    Column('output_data_id', Integer, ForeignKey('output_data.id', ondelete="CASCADE"))
)

class InputData(Base):
    __tablename__ = 'input_data'
    id = Column(Integer, primary_key=True)
    children = relationship(
        "OutputData",
        secondary=association_table,
        back_populates="inputs",
        cascade="all, delete",
    )

class OutputData(Base):
    __tablename__ = 'output_data'
    id = Column(Integer, primary_key=True)
    parents = relationship(
        "InputData",
        secondary=association_table,
        back_populates="outputs",
        passive_deletes=True
    )

Thank you in advance!

Upvotes: 0

Views: 304

Answers (1)

eshirvana
eshirvana

Reputation: 24613

well, delete cascade by nature is per record, which makes delete operation super slow (or update)

whereas you can have separate delete statements to delete from tables per set which is way faster than cascade delete.

so while cascade delete is more convenient for deleting one record for not a lot of records at the same time, manual deletion will be better approach, if you are deleting a set of records.

either way , adding index on foreign key will increase the performance of deletion.

Upvotes: 1

Related Questions