Marco
Marco

Reputation: 33

Cartesian product on same table

I'm trying to translate a R code in SQL Query.

tab1 <- merge(tab1, tab1, by='col1', allow.cartesian = T)
tab1 <- tab1[col2.x != col2.y,]

I have to do a cartesian product on same table, based on key col1 column.

This is my situation:

col1 | col2
 a   |  b 
 a   |  c 
 a   |  d 
 c   |  e 
 c   |  f 
 g   |  i 
 h   |  j 

and I'm trying to get something like this:

col1 | col2 | col3
 a   |  b   |  c
 a   |  b   |  d
 a   |  c   |  b
 a   |  c   |  d
 a   |  d   |  b
 a   |  d   |  c
 c   |  e   |  f
 c   |  f   |  e

so I did a query like this:

SELECT a.*, b.col2 as col3
FROM tab1 a
JOIN tab1 b on 
 a.col1 = b.col1 AND 
 a.col2 <> b.col2

The query works, but it's really really slow compared to R. The result it's correct but with col3 different order:

col1 | col2 | col3
 a   |  b   |  d
 a   |  b   |  c
 a   |  c   |  d
 a   |  c   |  b
 a   |  d   |  c
 a   |  d   |  b
 c   |  e   |  f
 c   |  f   |  e

Is this the only way to do this? And also there is a way to increment the performarces? I have a table with like 2M rows, and with the cartesian I get like 20M rows.

In R it takes like 1 or 2 minutes to get the result, on PostgreSQL it takes like 10-15 minutes.

Thank you so much.

Upvotes: 0

Views: 966

Answers (2)

Marco
Marco

Reputation: 33

Following the @jlandercy and @Edouard suggestions, I solved the problem creating a MATERIALIZED VIEW and an INDEX with the columns.

CREATE MATERIALIZED VIEW tab1_mat AS
SELECT *
FROM tab1

CREATE INDEX IF NOT EXISTS col1_col2 ON tab1 USING BTREE (col1, col2) ;

After this, I'm able to do my query:

SELECT a.*, b.col2 as col3
FROM tab1_mat a
JOIN tab1_mat b on 
 a.col1 = b.col1 AND 
 a.col2 <> b.col2

Then I scheduled a job that every night updates my materialized view using this command.

REFRESH MATERIALIZED VIEW tab1_mat;

Upvotes: 0

Edouard
Edouard

Reputation: 7075

"The result it's correct but with col3 different order"

To get the right order, you have to add the ORDER BY 1,2,3 clause to your self-joined query, but this is going to make it even slower ...

"there is a way to increment the performances ?"

Yes if you don't have yet created the following index, it should dramatically accelerate your self-joined query :

CREATE INDEX IF NOT EXISTS col1_col2 ON tab1 USING BTREE (col1, col2) ;

Upvotes: 1

Related Questions