Reputation: 35149
Trying to compare 2 queries to understand which one would be faster. Basic idea is to have 1 table ( say test1
) that has no duplicates. Then you trying to insert only delta from second table (say test2
) in such a way that if second table has duplicates, then only 1 copy of the record is inserted.
create table test1 (id varchar(10), a bigint, b bigint);
create table test2 (id varchar(10), a bigint, b bigint);
insert into test1 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test2 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
INSERT INTO test2
SELECT DISTINCT id,
a,
b
FROM test1
WHERE NOT EXISTS (SELECT *
FROM test2
WHERE test2.id = test1.id);
INSERT INTO test2
SELECT id,
a,
b
FROM (SELECT t2.*
FROM (SELECT Row_number() OVER(partition BY id) AS dup_id,
*
FROM test1) t2
WHERE t2.dup_id = 1) t1
WHERE t1.id NOT IN (SELECT test2.id
FROM test2);
Can someone help me understand which one would be faster and more efficient?
db=# explain insert into test2 select distinct id, a, b from test1 where not exists (select * from test2 where test2.id=test1.id); QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
XN Subquery Scan "*SELECT*" (cost=3613333.97..4213334.30 rows=7 width=49)
-> XN Unique (cost=3613333.97..4213334.23 rows=7 width=49)
-> XN Hash Left Join DS_BCAST_INNER (cost=3613333.97..4213334.18 rows=7 width=49)
Hash Cond: ("outer".oid = "inner".oid)
Filter: ("inner".oid IS NULL)
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=53)
-> XN Hash (cost=3613333.96..3613333.96 rows=5 width=4)
-> XN Subquery Scan volt_dt_1 (cost=1760000.36..3613333.96 rows=5 width=4)
-> XN Unique (cost=1760000.36..3613333.91 rows=5 width=4)
-> XN Hash Join DS_DIST_BOTH (cost=1760000.36..3613333.90 rows=5 width=4)
Outer Dist Key: test1.id
Inner Dist Key: volt_dt_2.id
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=37)
-> XN Hash (cost=1760000.34..1760000.34 rows=5 width=33)
-> XN Subquery Scan volt_dt_2 (cost=1760000.29..1760000.34 rows=5 width=33)
-> XN HashAggregate (cost=1760000.29..1760000.29 rows=5 width=33)
-> XN Hash Join DS_DIST_BOTH (cost=0.06..1760000.27 rows=5 width=33)
Outer Dist Key: test1.id
Inner Dist Key: test2.id
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=33)
-> XN Hash (cost=0.05..0.05 rows=5 width=33)
-> XN Seq Scan on test2 (cost=0.00..0.05 rows=5 width=33)
----- Tables missing statistics: test2, test1 -----
----- Update statistics by running the ANALYZE command on these tables -----
(26 rows)
Explain of the second query
db=# explain insert into test2 select id, a, b from (select t2.* from ( select row_number() over(partition by id order by id) as dup_id, * from test1 ) t2 where t2.dup_id = 1 ) t1 where t1.id not in (select test2.id from test2);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
XN Hash NOT IN Join DS_DIST_INNER (cost=1000000000000.23..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=49)
Inner Dist Key: db.test2.id
Hash Cond: (("outer".id)::text = ("inner".id)::text)
-> XN Subquery Scan t2 (cost=1000000000000.17..1000000000000.36 rows=1 width=49)
Filter: (dup_id = 1)
-> XN Window (cost=1000000000000.17..1000000000000.27 rows=7 width=49)
Partition: id
Order: id
-> XN Sort (cost=1000000000000.17..1000000000000.19 rows=7 width=49)
Sort Key: id
-> XN Network (cost=0.00..0.07 rows=7 width=49)
Distribute
-> XN Seq Scan on test1 (cost=0.00..0.07 rows=7 width=49)
-> XN Hash (cost=0.05..0.05 rows=5 width=33)
-> XN Seq Scan on test2 (cost=0.00..0.05 rows=5 width=33)
----- Tables missing statistics: test2, test1 -----
----- Update statistics by running the ANALYZE command on these tables -----
(17 rows)
Upvotes: 0
Views: 210
Reputation: 1158
As everyone knows INSERT is always expensive comparing to SELECT, UPDATE and DELETE since the INSERT doesn't have WHERE clauses and also more expensive when we have more indexes in a table.
Still both the query is doing an INSERT. So, my above point is not valid.
INSERT INTO test2
SELECT DISTINCT id,
a,
b
FROM test1
WHERE id NOT IN (SELECT id FROM test2);
As Everyone knows, SELECT * FROM Table is expensive, better to use required attributes.
Upvotes: 1
Reputation: 1270713
I think first should be faster, although it wants an index on test2(id)
.
Normally, the answer to such questions is "try it on your data and your system . . . and let us know". However, the row_number()
requires a full scan of table1
. You might as well do the index lookup in table2
at the same time -- which is the first version.
Upvotes: 2