jzadeh
jzadeh

Reputation: 703

Optimizing an SQL join on large tables

I am running a collection of sql query's against a large table (7,000,000 new rows/day) on a PostgreSQL database and have been running into some performance issues with first views and now creating tables. Most of the commands I am using are similar to the following query:

CREATE TABLE events_tb AS

SELECT *

FROM
(SELECT column1, column2, column3, column4, column5
 FROM test_database_1
 WHERE column6 = 'value1'
   AND date_column > '2012-07-01'
   AND date_column < '2012-07-10'
) a

INNER JOIN ( SELECT DISTINCT column1 FROM test_database_2) b
        ON (a.column1 = b.column1);

Is there a way to improve the above collection of statements to account for the fact that the query is being run against very large tables?

Upvotes: 1

Views: 11715

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 658767

This should be simpler and faster:

CREATE TABLE events_tb AS
SELECT column1, column2, column3, column4, column5
FROM   test_database_1 t1
WHERE  column6 = 'value1'
AND    date_column > '2012-07-01'
AND    date_column < '2012-07-10'
AND    EXISTS (
    SELECT 1
    FROM   test_database_2 t2
    WHERE  t2.column1 = t1.column1
    );

The way you had it would include column1 twice in the newly created table, which would result in an error message.

An EXISTS semi-join should be faster than a JOIN or IN expression, because it can stop execution at the first find. This is especially beneficial with duplicates - which you seem to have, judging from the DISTINCT in your query.

Upvotes: 5

biziclop
biziclop

Reputation: 14616

I know nothing about PostgreSQL's optimizer, but you could try replacing the INNER JOIN with an IN (SELECT...) structure:

CREATE TABLE events_tb AS

SELECT column1, column2, column3, column4, column5

FROM test_database_1

WHERE column6 = 'value1'
  AND date_column > '2012-07-01'
  AND date_column < '2012-07-10'
  AND column1 IN ( SELECT DISTINCT column1 FROM test_database_2 )

Upvotes: 2

Related Questions