Reputation: 703
I am running a collection of sql query's against a large table (7,000,000 new rows/day) on a PostgreSQL database and have been running into some performance issues with first views and now creating tables. Most of the commands I am using are similar to the following query:
CREATE TABLE events_tb AS
SELECT *
FROM
(SELECT column1, column2, column3, column4, column5
FROM test_database_1
WHERE column6 = 'value1'
AND date_column > '2012-07-01'
AND date_column < '2012-07-10'
) a
INNER JOIN ( SELECT DISTINCT column1 FROM test_database_2) b
ON (a.column1 = b.column1);
Is there a way to improve the above collection of statements to account for the fact that the query is being run against very large tables?
Upvotes: 1
Views: 11715
Reputation: 658767
This should be simpler and faster:
CREATE TABLE events_tb AS
SELECT column1, column2, column3, column4, column5
FROM test_database_1 t1
WHERE column6 = 'value1'
AND date_column > '2012-07-01'
AND date_column < '2012-07-10'
AND EXISTS (
SELECT 1
FROM test_database_2 t2
WHERE t2.column1 = t1.column1
);
The way you had it would include column1
twice in the newly created table, which would result in an error message.
An EXISTS
semi-join should be faster than a JOIN
or IN
expression, because it can stop execution at the first find. This is especially beneficial with duplicates - which you seem to have, judging from the DISTINCT
in your query.
Upvotes: 5
Reputation: 14616
I know nothing about PostgreSQL's optimizer, but you could try replacing the INNER JOIN
with an IN (SELECT...)
structure:
CREATE TABLE events_tb AS
SELECT column1, column2, column3, column4, column5
FROM test_database_1
WHERE column6 = 'value1'
AND date_column > '2012-07-01'
AND date_column < '2012-07-10'
AND column1 IN ( SELECT DISTINCT column1 FROM test_database_2 )
Upvotes: 2