Reputation: 35149

which of 2 queries has better prformance?

Trying to compare 2 queries to understand which one would be faster. Basic idea is to have 1 table ( say test1) that has no duplicates. Then you trying to insert only delta from second table (say test2) in such a way that if second table has duplicates, then only 1 copy of the record is inserted.

Prep:

create table test1  (id varchar(10), a bigint, b bigint);
create table test2  (id varchar(10), a bigint, b bigint);
insert into test1 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test1 values ('bbb', 2, 1), ('bb2', 2, 2);
insert into test2 values ('aaa', 1, 1), ('aa2', 1, 2), ('aa3', 1, 3);

Query 1:

INSERT INTO test2
SELECT DISTINCT id,
                a,
                b
FROM   test1
WHERE  NOT EXISTS (SELECT *
                   FROM   test2
                   WHERE  test2.id = test1.id);

Query 2:

INSERT INTO test2
    SELECT id,
           a,
           b
    FROM   (SELECT t2.*
            FROM   (SELECT Row_number() OVER(partition BY id) AS dup_id,
                           *
                    FROM   test1) t2
            WHERE  t2.dup_id = 1) t1
    WHERE  t1.id NOT IN (SELECT test2.id
                         FROM   test2);

Can someone help me understand which one would be faster and more efficient?

Update

Explain of the 1-st query:

db=# explain insert into test2 select distinct id, a, b from test1 where not exists (select * from test2 where test2.id=test1.id);                                                            QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan "*SELECT*"  (cost=3613333.97..4213334.30 rows=7 width=49)
   ->  XN Unique  (cost=3613333.97..4213334.23 rows=7 width=49)
         ->  XN Hash Left Join DS_BCAST_INNER  (cost=3613333.97..4213334.18 rows=7 width=49)
               Hash Cond: ("outer".oid = "inner".oid)
               Filter: ("inner".oid IS NULL)
               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=53)
               ->  XN Hash  (cost=3613333.96..3613333.96 rows=5 width=4)
                     ->  XN Subquery Scan volt_dt_1  (cost=1760000.36..3613333.96 rows=5 width=4)
                           ->  XN Unique  (cost=1760000.36..3613333.91 rows=5 width=4)
                                 ->  XN Hash Join DS_DIST_BOTH  (cost=1760000.36..3613333.90 rows=5 width=4)
                                       Outer Dist Key: test1.id
                                       Inner Dist Key: volt_dt_2.id
                                       Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                       ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=37)
                                       ->  XN Hash  (cost=1760000.34..1760000.34 rows=5 width=33)
                                             ->  XN Subquery Scan volt_dt_2  (cost=1760000.29..1760000.34 rows=5 width=33)
                                                   ->  XN HashAggregate  (cost=1760000.29..1760000.29 rows=5 width=33)
                                                         ->  XN Hash Join DS_DIST_BOTH  (cost=0.06..1760000.27 rows=5 width=33)
                                                               Outer Dist Key: test1.id
                                                               Inner Dist Key: test2.id
                                                               Hash Cond: (("outer".id)::text = ("inner".id)::text)
                                                               ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=33)
                                                               ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
                                                                     ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(26 rows)

Explain of the second query

db=# explain insert into test2 select id, a, b from (select t2.* from ( select row_number() over(partition by id order by id) as dup_id, * from test1 ) t2 where t2.dup_id = 1 ) t1 where t1.id not in (select test2.id from test2); 
                                                                                     QUERY PLAN                                                                                     
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Hash NOT IN Join DS_DIST_INNER  (cost=1000000000000.23..999999999999999967336168804116691273849533185806555472917961779471295845921727862608739868455469056.00 rows=1 width=49)
   Inner Dist Key: db.test2.id
   Hash Cond: (("outer".id)::text = ("inner".id)::text)
   ->  XN Subquery Scan t2  (cost=1000000000000.17..1000000000000.36 rows=1 width=49)
         Filter: (dup_id = 1)
         ->  XN Window  (cost=1000000000000.17..1000000000000.27 rows=7 width=49)
               Partition: id
               Order: id
               ->  XN Sort  (cost=1000000000000.17..1000000000000.19 rows=7 width=49)
                     Sort Key: id
                     ->  XN Network  (cost=0.00..0.07 rows=7 width=49)
                           Distribute
                           ->  XN Seq Scan on test1  (cost=0.00..0.07 rows=7 width=49)
   ->  XN Hash  (cost=0.05..0.05 rows=5 width=33)
         ->  XN Seq Scan on test2  (cost=0.00..0.05 rows=5 width=33)
 ----- Tables missing statistics: test2, test1 -----
 ----- Update statistics by running the ANALYZE command on these tables -----
(17 rows)

Upvotes: 0

Answers (2)

Muthaiah PL

Reputation: 1158

As everyone knows INSERT is always expensive comparing to SELECT, UPDATE and DELETE since the INSERT doesn't have WHERE clauses and also more expensive when we have more indexes in a table.

Still both the query is doing an INSERT. So, my above point is not valid.

In your first query, DISTINCT definitely needs a Table scan, and in the WHERE clause instead of NOT EXISTS, you can check the ID with the NOT IN operator

INSERT INTO test2
SELECT DISTINCT id,
a,
b
FROM test1
WHERE id NOT IN (SELECT id FROM test2);

As Everyone knows, SELECT * FROM Table is expensive, better to use required attributes.

In your second query, As Gordon said, ROW_NUMBER needs a table scan too and you are using lots of derived tables in that query, so i guess the first query is optimized than the second.

Upvotes: 1

Gordon Linoff

Reputation: 1270713

I think first should be faster, although it wants an index on test2(id).

Normally, the answer to such questions is "try it on your data and your system . . . and let us know". However, the row_number() requires a full scan of table1. You might as well do the index lookup in table2 at the same time -- which is the first version.

Upvotes: 2