How to Increase Query Speed of Large Tables in Postgres

Question

I have 2 tables I am using in a query: tableA and tableB. TableA has a row count of 80568353, and tableB is at 101806780. They both share the same 10 columns (tableB has an additional 4 columns) including p_id. My goal is to have a query that creates a new table from tableA where the p_id does not match the p_id in tableB. Here's my query:

CREATE TABLE my_schema.tableC AS
    SELECT * FROM my_schema.tableA
    WHERE p_id NOT IN (SELECT DISTINCT p_id::int FROM my_schema.tableB)

This ran overnight and never completed. I went and ran EXPLAIN to see why. Here' is the results:

 Gather  (cost=20252382.21..3330358576969.77 rows=40283932 width=89)
   Workers Planned: 7
   ->  Parallel Seq Scan on tableA  (cost=20251382.21..3330354547576.57 rows=5754847 width=89)
    Filter: (NOT (SubPlan 1))
    SubPlan 1
      ->  Materialize  (cost=20251382.21..20814813.52 rows=6107669 width=8)
            ->  Unique  (cost=20251382.21..20760416.17 rows=6107669 width=8)
                  ->  Sort  (cost=20251382.21..20505899.19 rows=101806792 width=8)
                        Sort Key: tableB.p_id
                        ->  Seq Scan on tableB  (cost=0.00..5079937.92 rows=101806792 width=8)

I'm new to PostgreSQL, but cost=20251382.21..3330354547576.57 seems extraordinarily high. I tried to go back and run the following to create an index to speed things up:

CREATE INDEX IF NOT EXISTS a_p_int_idx ON my_schema.tableA (CAST(p_id as int));
CREATE INDEX IF NOT EXISTS b_p_int_idx ON my_schema.tableB (CAST(p_id as int));

This didn't seem to help. I also ran VACUUM ANALYZE on both tables. No improvement. What am I doing wrong here; is there any way to increase the speed and get this to run Or an alternative method of achieving my goal? Note: tableA's p_id is an int and tableB's p_id is double precision. Would that affect it?

UPDATE: I was not able to get EXPLAIN ANALYZE to complete but I was able to run EXPLAIN VERBOSE

Gather  (cost=20506959.93..3330368701110.81 rows=40283932 width=89)
Output: tableA.geoid, tableA.p_id, tableA.land_use_t, tableA.prop_ind_t, tableA.story_nbr, tableA.bld_units, tableA.censuspop, tableA.hu_pop, tableA.point_x, tableA.point_y
Workers Planned: 7
->  Parallel Seq Scan on my_schema.tableA  (cost=20505959.93..3330364671717.61 rows=5754847 width=89)
    Output: tableA.geoid, tableAs.p_id, tableA.land_use_t, tableA.prop_ind_t, tableA.story_nbr, tableA.bld_units, tableA.censuspop, tableA.hu_pop, tableA.point_x, tableA.point_y
    Filter: (NOT (SubPlan 1))
    SubPlan 1
      ->  Materialize  (cost=20505959.93..21069392.95 rows=6107669 width=4)
            Output: ((tableB.p_id)::integer)
            ->  Unique  (cost=20505959.93..21014995.61 rows=6107669 width=4)
                  Output: ((tableB.p_id)::integer)
                  ->  Sort  (cost=20505959.93..20760477.77 rows=101807136 width=4)
                        Output: ((tableB.p_id)::integer)
                        Sort Key: ((tableB.p_id)::integer)
                        ->  Seq Scan on my_schema.tableB  (cost=0.00..5334459.20 rows=101807136 width=4)
                              Output: (tableB.p_id)::integer

I also tried

EXPLAIN VERBOSE
SELECT * FROM my_schema.tableA a
WHERE NOT EXISTS(SELECT p_id FROM 
my_schema.tableB 
WHERE p_id::int = a.p_id)

to test the NOT EXISTS suggestions but the cost came out the same (no real improvement).

How to Increase Query Speed of Large Tables in Postgres

Answers (1)

Related Questions