Reputation: 823

SQL: Use distinct on groups of similar data

Hello Mates I have the following problem in a Vertica database: I have a large Table

+------+------+------+
| Date | Col1 | Col2 |
+------+------+------+
|    1 | A    | B    |
|    2 | A    | B    |
|    3 | D    | E    |
|    2 | C    | D    |
|    1 | C    | D    |
+------+------+------+

As you can see I have redundant data, just taken on different dates (row 1 & 2 and row 4 & 5). So I would like a table that removes that redundant data by deleting the rows with the lower date, giving me a result like that:

+------+------+------+
| Date | Col1 | Col2 |
+------+------+------+
|    2 | A    | B    |
|    2 | C    | D    |
|    3 | D    | E    |
+------+------+------+

Using distinct would not work since it will delete rows randomly not considering the date, so I might end up with a table like this:

SELECT DISTINCT Col2, Col3 from Table 


+------+------+------+
| Date | Col1 | Col2 |
+------+------+------+
|    2 | A    | B    |
|    1 | C    | D    |
|    3 | D    | E    |
+------+------+------+

which is not desired.

Is there anyway to accomplish that? Thanks mates

Upvotes: 0

Answers (3)

woot

Reputation: 7616

I'm just generalizing the patterns here and adding one, for the exact question asked any of these methods would probably work, the devil is in the details.

The aggregate method proposed by @Thomas_G works because you only have 1 column outside the grouping. If you had two it could mix/match (some data from one row, some from another) which is not likely what you want as a duplicate handling strategy.

The analytical method proposed by @Gordon_Linoff is good, but be aware that if the date is duplicated in the source data, then you'll get multiple rows if they exist on the max date. This might be what you want, but maybe not.

Another method is to just peel off the top row in the window. It will choose the first row in the partition based on your window ordering. If there are multiples dates at the max, then you can't guarantee which one will be chosen unless you include something more in the window order. But at least you know you'll only get one row, for what it's worth.

select t.*
from (select t.*, row_number() over (partition by col1, col2 order by date desc) as rn
      from t
     ) t
where rn = 1;

Upvotes: 2

Gordon Linoff

Reputation: 1269883

If there are other columns that you care about, you can use window functions:

select t.*
from (select t.*, max(date) over (partition by col1, col2) as maxd
      from t
     ) t
where date = maxd;

Upvotes: 1

Thomas G

Reputation: 10216

Do a GROUP BY on your 2 columns and aggregate on the highest date:

SELECT MAX(Date), col1, col2
FROM table
GROUP BY Col1, Col2

Upvotes: 4

SQL: Use distinct on groups of similar data

Answers (3)

Related Questions