Optimizing an SQL join on large tables

Question

I am running a collection of sql query's against a large table (7,000,000 new rows/day) on a PostgreSQL database and have been running into some performance issues with first views and now creating tables. Most of the commands I am using are similar to the following query:

CREATE TABLE events_tb AS

SELECT *

FROM
(SELECT column1, column2, column3, column4, column5
 FROM test_database_1
 WHERE column6 = 'value1'
   AND date_column > '2012-07-01'
   AND date_column < '2012-07-10'
) a

INNER JOIN ( SELECT DISTINCT column1 FROM test_database_2) b
        ON (a.column1 = b.column1);

Is there a way to improve the above collection of statements to account for the fact that the query is being run against very large tables?

Erwin Brandstetter · Accepted Answer

This should be simpler and faster:

CREATE TABLE events_tb AS
SELECT column1, column2, column3, column4, column5
FROM   test_database_1 t1
WHERE  column6 = 'value1'
AND    date_column > '2012-07-01'
AND    date_column < '2012-07-10'
AND    EXISTS (
    SELECT 1
    FROM   test_database_2 t2
    WHERE  t2.column1 = t1.column1
    );

The way you had it would include column1 twice in the newly created table, which would result in an error message.

An EXISTS semi-join should be faster than a JOIN or IN expression, because it can stop execution at the first find. This is especially beneficial with duplicates - which you seem to have, judging from the DISTINCT in your query.

Optimizing an SQL join on large tables

Answers (2)

Related Questions