Reputation: 67
The following PostgreSQL query
UPDATE table_A A
SET is_active = false
FROM table_A
WHERE A.parent_id IS NULL AND A.is_active = true AND A.id = ANY
(SELECT (B.parent_id)
FROM table_A B
INNER JOIN table_B ON table_A.foreign_id = table_B.id
WHERE table_B.deleted = true);
gets stuck loading endlessly. I know correlated sub-queries can take long, but a SELECT using the same parameters worked quickly and returned the desired results. And I have a small data set that I let run for an entire day just to make sure it wouldn't eventually work with time.
Table_A uses hierarchical data structures and only a specific level of hierarchy has a foreign key that I can use to join and check the second table. The idea is :
Find all rows in Table_A whose associated Table_B row has its "deleted" value set to true.
From this set of results get the parent_id column
For any row in table_A whose id is part of the parent_id column, so for all parents, check if their is_active is true and if so make it false.
The EXPLAIN:
Update on table_A A (cost=0.00..3906658758867.89 rows=89947680 width=192) -> Nested Loop (cost=0.00..3906658758867.89 rows=89947680 width=192)
Join Filter: (SubPlan 1)
-> Seq Scan on table_A (cost=0.00..37899.20 rows=410720 width=14)
-> Materialize (cost=0.00..37901.39 rows=438 width=185)
-> Seq Scan on table_A A (cost=0.00..37899.20 rows=438 width=185)
Filter: ((parent_id IS NULL) AND is_active)
SubPlan 1
-> Nested Loop (cost=0.00..42405.74 rows=410720 width=8)
-> Seq Scan on table_B (cost=0.00..399.34 rows=1 width=0)
Filter: (deleted AND (table_A.foreign_id = id))
-> Seq Scan on table_A B (cost=0.00..37899.20 rows=410720 width=8)
JIT: Functions: 17 " Options: Inlining true, Optimization true, Expressions true, Deforming true"
Upvotes: 2
Views: 1616
Reputation: 29677
Sometimes proper use of aliases can make a difference.
Compare the following 2 query plans.
The first is the original query run on sample data.
cost=0.00..314144.00 rows=4975
Estimated to update 4975 rows on a table of less than 10 rows?
The second is a slightly modified version of the first.
cost=92.26..122.20 rows=2
EXPLAIN UPDATE table_A A SET is_active = false FROM table_A WHERE A.parent_id IS NULL AND A.is_active = true AND A.id = ANY ( SELECT (B.parent_id) FROM table_A B INNER JOIN table_B ON table_A.foreign_id = table_B.id WHERE table_B.deleted = true );
| QUERY PLAN | | :--------------------------------------------------------------------------------------------- | | Update on table_a a (cost=0.00..314144.00 rows=4975 width=25) | | -> Nested Loop (cost=0.00..314144.00 rows=4975 width=25) | | Join Filter: (SubPlan 1) | | -> Seq Scan on table_a (cost=0.00..29.90 rows=1990 width=10) | | -> Materialize (cost=0.00..29.93 rows=5 width=18) | | -> Seq Scan on table_a a (cost=0.00..29.90 rows=5 width=18) | | Filter: ((parent_id IS NULL) AND is_active) | | SubPlan 1 | | -> Nested Loop (cost=0.15..57.97 rows=1990 width=4) | | -> Index Scan using table_b_pkey on table_b (cost=0.15..8.17 rows=1 width=0) | | Index Cond: (table_a.foreign_id = id) | | Filter: deleted | | -> Seq Scan on table_a b (cost=0.00..29.90 rows=1990 width=4) |
EXPLAIN UPDATE table_A SET is_active = false WHERE parent_id IS NULL AND is_active = true AND id = ANY ( SELECT a2.parent_id FROM table_A a2 JOIN table_B b ON a2.foreign_id = b.id WHERE b.deleted = true );
| QUERY PLAN | | :----------------------------------------------------------------------------------------------- | | Update on table_a (cost=92.26..122.20 rows=2 width=31) | | -> Hash Join (cost=92.26..122.20 rows=2 width=31) | | Hash Cond: (table_a.id = a2.parent_id) | | -> Seq Scan on table_a (cost=0.00..29.90 rows=5 width=18) | | Filter: ((parent_id IS NULL) AND is_active) | | -> Hash (cost=89.76..89.76 rows=200 width=16) | | -> HashAggregate (cost=87.76..89.76 rows=200 width=16) | | Group Key: a2.parent_id | | -> Hash Join (cost=50.14..85.27 rows=995 width=16) | | Hash Cond: (a2.foreign_id = b.id) | | -> Seq Scan on table_a a2 (cost=0.00..29.90 rows=1990 width=14) | | -> Hash (cost=34.70..34.70 rows=1235 width=10) | | -> Seq Scan on table_b b (cost=0.00..34.70 rows=1235 width=10) | | Filter: deleted |
The second query only uses a few more alias names.
The update statement can also be written as a join on a sub-query.
UPDATE table_A AS parent SET is_active = false FROM ( SELECT child.parent_id FROM table_A AS child JOIN table_B AS dream ON child.foreign_id = dream.id WHERE child.parent_id IS NOT NULL AND dream.deleted = true GROUP BY child.parent_id ) dreamless WHERE parent.id = dreamless.parent_id AND parent.parent_id IS NULL AND parent.is_active = true;
1 rows affected
SELECT * FROM table_A
id | parent_id | is_active | foreign_id -: | --------: | :-------- | ---------: 2 | 1 | t | 2 3 | 1 | t | 5 4 | null | t | 3 5 | 3 | t | 4 1 | null | f | 1
Test on db<>fiddle here
Upvotes: 2
Reputation: 108841
I think this fiddle fairly represents your situation.
Two different things could be going on to make this run super-slow.
First, it looks like your update query does a full table scan of a 90-megarow table. That's a lot of data.
Creating an index on table_A might help expedite the finding of eligible rows in table_A.
CREATE INDEX "active_parentnull_id"
ON table_A USING BTREE
("is_active", ("parent_id" IS NULL), "id");
Similarly, creating an index on TABLE_B might help there.
CREATE INDEX "deleted_id"
ON table_B USING BTREE
("deleted", "id");
Second, it's possible you're updating a vast number of rows. UPDATE operations on a great many rows can take an absurdly long time because of transaction semantics: the RDBMS does its best to make it look, to other users of your data, like your updates happened instanteously. To pull that off for many rows takes lots of IO and CPU.
So, you should try running the update in batches. Refactor your query like this and use a LIMIT clause.
UPDATE table_A
SET is_active = false
WHERE id IN (
SELECT DISTINCT id
FROM table_A A
WHERE A.parent_id IS NULL
AND A.is_active = true
AND A.id = ANY (
SELECT (B.parent_id)
FROM table_A B
INNER JOIN table_B ON table_A.foreign_id = table_B.id
WHERE table_B.deleted = true)
LIMIT 1000);
Then run the query repeatedly until it updates no rows. It may take a while, but definitely will take less time than trying to do everything at once.
Upvotes: 1