Reputation: 235
This might seem duplicate but I could not find an answer that matches my requirements. Referred here but its not quite the same and answers does not apply to me. Hence asking.
SELECT tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4
from
(SELECT col1, col2, col3
FROM table1
GROUP BY col1, col2, col3) AS tab1
JOIN tab2
ON tab1.col1 = tab2.col1
AND tab1.col2 = tab2.col2
GROUP BY tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4
Example Data:
Table1 Tab2
col1 | col2 | col3 col1 | col2 | col3 | col4
======================= =============================
page1 image1 referer1 page1 image1 150 75
page1 image1 referer1 page1 image1 120 85
page2 image2 referer2 page2 image2 200 400
page1 image1 referer1 page1 image1 750 1024
page2 image2 referer2 page2 image2 450 575
page1 image1 referer1 page1 image1 600 900
Expected Output:
tab1.col1 | tab1.col2 | tab1.col3 | tab2.col3 | tab2.col4
==================================================================
page1 image1 referer1 600 900
page2 image2 referer2 200 400
Here the last group by returns unique rows but all the tab1 cols are repeating which I don't want and in this case the records only vary on tab2.col1 and tab2.col2. Now the requirement is that I want tab1.col1, tab1.col2, tab1.col3 to be unique and only a pair of respective tab2.col1 and tab2.col2 for those columns from tab2. Now I cannot remove the tab2.col1, tab2.col2 from the second group by and apply a min or max aggregate function because in that case I wont get the tab2.col1 value which maps to the tab2.col2 value for a particular record.
Note: I am using Amazon Redshift as DB. The subquery is important since that is actually a complicated result set produced from 3 table join and if I use a join of these 3 tables directly with tab2 then the query runs forever. For the sake of simplicity of this question let assume that the subquery returns col1, col2, col3 from table1. Tab2 and tab1 are ginormous tables :D... the subquery improves performance considerably (reduced 20mins to ~2mins).
Upvotes: 0
Views: 82
Reputation: 2774
If below result is your requirement,
tab1.col1 | tab1.col2 | tab1.col3 | tab2.col3 | tab2.col4
==================================================================
page1 image1 referer1 120 85
page2 image2 referer2 200 400
then you can achieve it using below redshift sql query
SELECT tab1.col1, tab1.col2, tab1.col3, tab2.col3, tab2.col4
FROM
(SELECT col1, col2, col3
FROM table1
GROUP BY col1, col2, col3) AS tab1
JOIN
(SELECT col1, col2, col3, col4
FROM
(SELECT col1, col2, col3, col4, ROW_NUMBER() OVER(PARTITION BY col1, col2 ORDER BY col3 ASC, col4 ASC) row_num
FROM table2) tab2
WHERE row_num = 1) tab2 ON tab1.col1 = tab2.col1
AND tab1.col2 = tab2.col2
The key in the above sql statement is "ROW_NUMBER() OVER(PARTITION BY col1, col2 ORDER BY col3 ASC, col4 ASC)". This will give you min value of tab2.col3 and min value of tab2.col4. If you need max value, then change the order to DESC
Hope this should solve your question.
You can also check the result-set here which I have written
Upvotes: 0