Eric Ericson
Eric Ericson

Reputation: 67

Why does this SQL query get stuck in an endless loop?

The following PostgreSQL query

UPDATE table_A A 
SET is_active = false 
FROM table_A 
WHERE A.parent_id IS NULL AND A.is_active = true AND A.id = ANY 
(SELECT (B.parent_id) 
FROM table_A B 
INNER JOIN table_B ON table_A.foreign_id = table_B.id 
WHERE table_B.deleted = true);

gets stuck loading endlessly. I know correlated sub-queries can take long, but a SELECT using the same parameters worked quickly and returned the desired results. And I have a small data set that I let run for an entire day just to make sure it wouldn't eventually work with time.

Table_A uses hierarchical data structures and only a specific level of hierarchy has a foreign key that I can use to join and check the second table. The idea is :

  1. Find all rows in Table_A whose associated Table_B row has its "deleted" value set to true.

  2. From this set of results get the parent_id column

  3. For any row in table_A whose id is part of the parent_id column, so for all parents, check if their is_active is true and if so make it false.

The EXPLAIN:

Update on table_A A  (cost=0.00..3906658758867.89 rows=89947680 width=192)  ->  Nested Loop  (cost=0.00..3906658758867.89 rows=89947680 width=192)
    Join Filter: (SubPlan 1)
    ->  Seq Scan on table_A  (cost=0.00..37899.20 rows=410720 width=14)
    ->  Materialize  (cost=0.00..37901.39 rows=438 width=185)
          ->  Seq Scan on table_A A  (cost=0.00..37899.20 rows=438 width=185)
                Filter: ((parent_id IS NULL) AND is_active)
    SubPlan 1
      ->  Nested Loop  (cost=0.00..42405.74 rows=410720 width=8)
            ->  Seq Scan on table_B  (cost=0.00..399.34 rows=1 width=0)
                  Filter: (deleted AND (table_A.foreign_id = id))
            ->  Seq Scan on table_A B  (cost=0.00..37899.20 rows=410720 width=8) 
JIT:  Functions: 17 "  Options: Inlining true, Optimization true, Expressions true, Deforming true"

Upvotes: 2

Views: 1616

Answers (2)

LukStorms
LukStorms

Reputation: 29677

Sometimes proper use of aliases can make a difference.

Compare the following 2 query plans.
The first is the original query run on sample data.
cost=0.00..314144.00 rows=4975 Estimated to update 4975 rows on a table of less than 10 rows?

The second is a slightly modified version of the first.
cost=92.26..122.20 rows=2

EXPLAIN
UPDATE table_A A SET is_active = false 
FROM table_A 
WHERE A.parent_id IS NULL 
  AND A.is_active = true 
  AND A.id = ANY (
    SELECT (B.parent_id) 
    FROM table_A B 
    INNER JOIN table_B ON table_A.foreign_id = table_B.id 
    WHERE table_B.deleted = true
);
| QUERY PLAN                                                                                     |
| :--------------------------------------------------------------------------------------------- |
| Update on table_a a  (cost=0.00..314144.00 rows=4975 width=25)                                 |
|   ->  Nested Loop  (cost=0.00..314144.00 rows=4975 width=25)                                   |
|         Join Filter: (SubPlan 1)                                                               |
|         ->  Seq Scan on table_a  (cost=0.00..29.90 rows=1990 width=10)                         |
|         ->  Materialize  (cost=0.00..29.93 rows=5 width=18)                                    |
|               ->  Seq Scan on table_a a  (cost=0.00..29.90 rows=5 width=18)                    |
|                     Filter: ((parent_id IS NULL) AND is_active)                                |
|         SubPlan 1                                                                              |
|           ->  Nested Loop  (cost=0.15..57.97 rows=1990 width=4)                                |
|                 ->  Index Scan using table_b_pkey on table_b  (cost=0.15..8.17 rows=1 width=0) |
|                       Index Cond: (table_a.foreign_id = id)                                    |
|                       Filter: deleted                                                          |
|                 ->  Seq Scan on table_a b  (cost=0.00..29.90 rows=1990 width=4)                |
EXPLAIN 
UPDATE table_A SET is_active = false 
WHERE parent_id IS NULL 
  AND is_active = true 
  AND id = ANY (
    SELECT a2.parent_id
    FROM table_A a2 
    JOIN table_B b ON a2.foreign_id = b.id 
    WHERE b.deleted = true
  );
| QUERY PLAN                                                                                       |
| :----------------------------------------------------------------------------------------------- |
| Update on table_a  (cost=92.26..122.20 rows=2 width=31)                                          |
|   ->  Hash Join  (cost=92.26..122.20 rows=2 width=31)                                            |
|         Hash Cond: (table_a.id = a2.parent_id)                                                   |
|         ->  Seq Scan on table_a  (cost=0.00..29.90 rows=5 width=18)                              |
|               Filter: ((parent_id IS NULL) AND is_active)                                        |
|         ->  Hash  (cost=89.76..89.76 rows=200 width=16)                                          |
|               ->  HashAggregate  (cost=87.76..89.76 rows=200 width=16)                           |
|                     Group Key: a2.parent_id                                                      |
|                     ->  Hash Join  (cost=50.14..85.27 rows=995 width=16)                         |
|                           Hash Cond: (a2.foreign_id = b.id)                                      |
|                           ->  Seq Scan on table_a a2  (cost=0.00..29.90 rows=1990 width=14)      |
|                           ->  Hash  (cost=34.70..34.70 rows=1235 width=10)                       |
|                                 ->  Seq Scan on table_b b  (cost=0.00..34.70 rows=1235 width=10) |
|                                       Filter: deleted                                            |

The second query only uses a few more alias names.

The update statement can also be written as a join on a sub-query.

UPDATE table_A AS parent
   SET is_active = false 
FROM (
   SELECT child.parent_id
   FROM table_A AS child 
   JOIN table_B AS dream
     ON child.foreign_id = dream.id
   WHERE child.parent_id IS NOT NULL
     AND dream.deleted = true
   GROUP BY child.parent_id
) dreamless 
WHERE parent.id = dreamless.parent_id
  AND parent.parent_id IS NULL
  AND parent.is_active = true;
1 rows affected
SELECT * FROM table_A
id | parent_id | is_active | foreign_id
-: | --------: | :-------- | ---------:
 2 |         1 | t         |          2
 3 |         1 | t         |          5
 4 |      null | t         |          3
 5 |         3 | t         |          4
 1 |      null | f         |          1

Test on db<>fiddle here

Upvotes: 2

O. Jones
O. Jones

Reputation: 108841

I think this fiddle fairly represents your situation.

Two different things could be going on to make this run super-slow.

First, it looks like your update query does a full table scan of a 90-megarow table. That's a lot of data.

Creating an index on table_A might help expedite the finding of eligible rows in table_A.

CREATE INDEX "active_parentnull_id" 
          ON table_A USING BTREE
             ("is_active", ("parent_id" IS NULL), "id");

Similarly, creating an index on TABLE_B might help there.

CREATE INDEX "deleted_id"
          ON table_B USING BTREE
          ("deleted", "id"); 

Second, it's possible you're updating a vast number of rows. UPDATE operations on a great many rows can take an absurdly long time because of transaction semantics: the RDBMS does its best to make it look, to other users of your data, like your updates happened instanteously. To pull that off for many rows takes lots of IO and CPU.

So, you should try running the update in batches. Refactor your query like this and use a LIMIT clause.

UPDATE table_A
   SET is_active = false   
 WHERE id IN (
       SELECT DISTINCT id
         FROM table_A A
        WHERE A.parent_id IS NULL
          AND A.is_active = true
          AND A.id = ANY (
             SELECT (B.parent_id) 
               FROM table_A B 
               INNER JOIN table_B ON table_A.foreign_id = table_B.id 
               WHERE table_B.deleted = true)
         LIMIT 1000);

Then run the query repeatedly until it updates no rows. It may take a while, but definitely will take less time than trying to do everything at once.

Upvotes: 1

Related Questions