Reputation: 1923
I have 2 tables I am using in a query: tableA and tableB. TableA has a row count of 80568353, and tableB is at 101806780. They both share the same 10 columns (tableB has an additional 4 columns) including p_id. My goal is to have a query that creates a new table from tableA where the p_id does not match the p_id in tableB. Here's my query:
CREATE TABLE my_schema.tableC AS
SELECT * FROM my_schema.tableA
WHERE p_id NOT IN (SELECT DISTINCT p_id::int FROM my_schema.tableB)
This ran overnight and never completed. I went and ran EXPLAIN
to see why. Here' is the results:
Gather (cost=20252382.21..3330358576969.77 rows=40283932 width=89)
Workers Planned: 7
-> Parallel Seq Scan on tableA (cost=20251382.21..3330354547576.57 rows=5754847 width=89)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=20251382.21..20814813.52 rows=6107669 width=8)
-> Unique (cost=20251382.21..20760416.17 rows=6107669 width=8)
-> Sort (cost=20251382.21..20505899.19 rows=101806792 width=8)
Sort Key: tableB.p_id
-> Seq Scan on tableB (cost=0.00..5079937.92 rows=101806792 width=8)
I'm new to PostgreSQL, but cost=20251382.21..3330354547576.57 seems extraordinarily high. I tried to go back and run the following to create an index to speed things up:
CREATE INDEX IF NOT EXISTS a_p_int_idx ON my_schema.tableA (CAST(p_id as int));
CREATE INDEX IF NOT EXISTS b_p_int_idx ON my_schema.tableB (CAST(p_id as int));
This didn't seem to help. I also ran VACUUM ANALYZE
on both tables. No improvement. What am I doing wrong here; is there any way to increase the speed and get this to run Or an alternative method of achieving my goal? Note: tableA's p_id is an int and tableB's p_id is double precision. Would that affect it?
UPDATE: I was not able to get EXPLAIN ANALYZE
to complete but I was able to run EXPLAIN VERBOSE
Gather (cost=20506959.93..3330368701110.81 rows=40283932 width=89)
Output: tableA.geoid, tableA.p_id, tableA.land_use_t, tableA.prop_ind_t, tableA.story_nbr, tableA.bld_units, tableA.censuspop, tableA.hu_pop, tableA.point_x, tableA.point_y
Workers Planned: 7
-> Parallel Seq Scan on my_schema.tableA (cost=20505959.93..3330364671717.61 rows=5754847 width=89)
Output: tableA.geoid, tableAs.p_id, tableA.land_use_t, tableA.prop_ind_t, tableA.story_nbr, tableA.bld_units, tableA.censuspop, tableA.hu_pop, tableA.point_x, tableA.point_y
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=20505959.93..21069392.95 rows=6107669 width=4)
Output: ((tableB.p_id)::integer)
-> Unique (cost=20505959.93..21014995.61 rows=6107669 width=4)
Output: ((tableB.p_id)::integer)
-> Sort (cost=20505959.93..20760477.77 rows=101807136 width=4)
Output: ((tableB.p_id)::integer)
Sort Key: ((tableB.p_id)::integer)
-> Seq Scan on my_schema.tableB (cost=0.00..5334459.20 rows=101807136 width=4)
Output: (tableB.p_id)::integer
I also tried
EXPLAIN VERBOSE
SELECT * FROM my_schema.tableA a
WHERE NOT EXISTS(SELECT p_id FROM
my_schema.tableB
WHERE p_id::int = a.p_id)
to test the NOT EXISTS
suggestions but the cost came out the same (no real improvement).
Upvotes: 1
Views: 703
Reputation: 3303
In many Databases the command NOT IN
or NOT EXISTS
gets bad performance, because you can write this query using the left join
and you will get very very high performance. For Example:
SELECT * FROM my_schema.tableA aa
left join my_schema.tableB bb on aa.p_id = bb.p_id
WHERE
bb.id is null --here you must use one of the not null field of tableB (for example primary key of tableB)
And you must create a normal btree index
for p_id
fields in both tables.
Upvotes: 2