gwydion93
gwydion93

Reputation: 1923

How to Increase Query Speed of Large Tables in Postgres

I have 2 tables I am using in a query: tableA and tableB. TableA has a row count of 80568353, and tableB is at 101806780. They both share the same 10 columns (tableB has an additional 4 columns) including p_id. My goal is to have a query that creates a new table from tableA where the p_id does not match the p_id in tableB. Here's my query:

CREATE TABLE my_schema.tableC AS
    SELECT * FROM my_schema.tableA
    WHERE p_id NOT IN (SELECT DISTINCT p_id::int FROM my_schema.tableB)

This ran overnight and never completed. I went and ran EXPLAIN to see why. Here' is the results:

 Gather  (cost=20252382.21..3330358576969.77 rows=40283932 width=89)
   Workers Planned: 7
   ->  Parallel Seq Scan on tableA  (cost=20251382.21..3330354547576.57 rows=5754847 width=89)
    Filter: (NOT (SubPlan 1))
    SubPlan 1
      ->  Materialize  (cost=20251382.21..20814813.52 rows=6107669 width=8)
            ->  Unique  (cost=20251382.21..20760416.17 rows=6107669 width=8)
                  ->  Sort  (cost=20251382.21..20505899.19 rows=101806792 width=8)
                        Sort Key: tableB.p_id
                        ->  Seq Scan on tableB  (cost=0.00..5079937.92 rows=101806792 width=8)

I'm new to PostgreSQL, but cost=20251382.21..3330354547576.57 seems extraordinarily high. I tried to go back and run the following to create an index to speed things up:

CREATE INDEX IF NOT EXISTS a_p_int_idx ON my_schema.tableA (CAST(p_id as int));
CREATE INDEX IF NOT EXISTS b_p_int_idx ON my_schema.tableB (CAST(p_id as int));

This didn't seem to help. I also ran VACUUM ANALYZE on both tables. No improvement. What am I doing wrong here; is there any way to increase the speed and get this to run Or an alternative method of achieving my goal? Note: tableA's p_id is an int and tableB's p_id is double precision. Would that affect it?

UPDATE: I was not able to get EXPLAIN ANALYZE to complete but I was able to run EXPLAIN VERBOSE

Gather  (cost=20506959.93..3330368701110.81 rows=40283932 width=89)
Output: tableA.geoid, tableA.p_id, tableA.land_use_t, tableA.prop_ind_t, tableA.story_nbr, tableA.bld_units, tableA.censuspop, tableA.hu_pop, tableA.point_x, tableA.point_y
Workers Planned: 7
->  Parallel Seq Scan on my_schema.tableA  (cost=20505959.93..3330364671717.61 rows=5754847 width=89)
    Output: tableA.geoid, tableAs.p_id, tableA.land_use_t, tableA.prop_ind_t, tableA.story_nbr, tableA.bld_units, tableA.censuspop, tableA.hu_pop, tableA.point_x, tableA.point_y
    Filter: (NOT (SubPlan 1))
    SubPlan 1
      ->  Materialize  (cost=20505959.93..21069392.95 rows=6107669 width=4)
            Output: ((tableB.p_id)::integer)
            ->  Unique  (cost=20505959.93..21014995.61 rows=6107669 width=4)
                  Output: ((tableB.p_id)::integer)
                  ->  Sort  (cost=20505959.93..20760477.77 rows=101807136 width=4)
                        Output: ((tableB.p_id)::integer)
                        Sort Key: ((tableB.p_id)::integer)
                        ->  Seq Scan on my_schema.tableB  (cost=0.00..5334459.20 rows=101807136 width=4)
                              Output: (tableB.p_id)::integer

I also tried

EXPLAIN VERBOSE
SELECT * FROM my_schema.tableA a
WHERE NOT EXISTS(SELECT p_id FROM 
my_schema.tableB 
WHERE p_id::int = a.p_id)

to test the NOT EXISTS suggestions but the cost came out the same (no real improvement).

Upvotes: 1

Views: 703

Answers (1)

Ramin Faracov
Ramin Faracov

Reputation: 3303

In many Databases the command NOT IN or NOT EXISTS gets bad performance, because you can write this query using the left join and you will get very very high performance. For Example:

SELECT * FROM my_schema.tableA aa 
left join my_schema.tableB bb on aa.p_id = bb.p_id 
WHERE 
    bb.id is null --here you must use one of the not null field of tableB (for example primary key of tableB)

And you must create a normal btree index for p_id fields in both tables.

Upvotes: 2

Related Questions